System Overview

The detextive library consolidates MIME detection, charset inference, text decoding, and line-separator utilities behind a unified functional API.

Major Components

Public API

The public API is composed of confidence-aware detection functions, inference orchestration functions, and high-level decode functions:

  • detect_charset / detect_charset_confidence

  • detect_mimetype / detect_mimetype_confidence

  • infer_charset / infer_charset_confidence

  • infer_mimetype_charset / infer_mimetype_charset_confidence

  • decode

  • decode_inform

  • is_textual_mimetype

  • is_valid_text

  • LineSeparators utilities

Core Types and Configuration

  • Behaviors - policy object controlling parse/detect/trial/validation behaviors and confidence thresholds.

  • BehaviorTristate - execution mode for selected behavior paths (Never/AsNeeded/Always).

  • DetectFailureActions - fallback policy on detector failure (Default/Error).

  • CodecSpecifiers - dynamic trial codec slots (FromInference/OsDefault/PythonDefault/UserSupplement).

  • CharsetResult - charset with confidence score.

  • MimetypeResult - MIME type with confidence score.

  • DecodeInformResult - decoded text plus charset/mimetype/line-separator metadata.

Layered Runtime Architecture

┌──────────────────────────────────────────────────────┐
│ Public API (__init__.py re-exports)                 │
└──────────────────────────────────────────────────────┘
                      │
┌──────────────────────────────────────────────────────┐
│ Decoding Layer (decoders.py)                        │
│ decode(), decode_inform()                           │
│ - HTTP Content-Type parse + charset-first attempt   │
│ - detector-assisted trial decode + text validation  │
│ - optional MIME/line-separator metadata             │
└──────────────────────────────────────────────────────┘
                      │
┌──────────────────────────────────────────────────────┐
│ Inference Layer (inference.py)                      │
│ infer_*() orchestration + header/location context   │
└──────────────────────────────────────────────────────┘
                      │
┌──────────────────────────────────────────────────────┐
│ Detection Layer (detectors.py)                      │
│ detector registries + confidence results            │
└──────────────────────────────────────────────────────┘
                      │
┌──────────────────────────────────────────────────────┐
│ Support Layer                                        │
│ charsets.py, mimetypes.py, validation.py,           │
│ lineseparators.py                                   │
└──────────────────────────────────────────────────────┘

Decoder Flow (v3)

decode and decode_inform share the same decoding core:

  1. Parse http_content_type when provided.

  2. If header MIME is non-textual, raise ContentDecodeImpossibility.

  3. If header charset is textual and decodable, decode with that charset first.

  4. Otherwise, run detector-assisted trial decodes in configured codec order.

  5. Apply text validation according to Behaviors.text_validate and Behaviors.text_validate_confidence.

  6. Return text (decode) or structured metadata (decode_inform).

Inference Flow

infer_* functions use contextual hints and detection orchestration:

  1. Optionally parse http_content_type depending on behavior settings.

  2. Consider location-based MIME hints.

  3. Run registered detectors for MIME and charset as configured.

  4. Apply *_default values only for fallback return semantics.

  5. Use *_supplement values as hints to guide detection/validation.

Integration Notes

  • decode is authoritative for byte-to-text conversion and raises on irrecoverable decode failure.

  • decode_inform is intended for callers that need text plus consistent decode metadata in one call.

  • Detector registries are pluggable and backend-optional by design.

  • Trial codec ordering is behavior-driven and can be overridden by callers.