System Overview¶
The detextive library consolidates MIME detection, charset inference, text decoding, and line-separator utilities behind a unified functional API.
Major Components¶
Public API¶
The public API is composed of confidence-aware detection functions, inference orchestration functions, and high-level decode functions:
detect_charset/detect_charset_confidencedetect_mimetype/detect_mimetype_confidenceinfer_charset/infer_charset_confidenceinfer_mimetype_charset/infer_mimetype_charset_confidencedecodedecode_informis_textual_mimetypeis_valid_textLineSeparatorsutilities
Core Types and Configuration¶
Behaviors- policy object controlling parse/detect/trial/validation behaviors and confidence thresholds.BehaviorTristate- execution mode for selected behavior paths (Never/AsNeeded/Always).DetectFailureActions- fallback policy on detector failure (Default/Error).CodecSpecifiers- dynamic trial codec slots (FromInference/OsDefault/PythonDefault/UserSupplement).CharsetResult- charset with confidence score.MimetypeResult- MIME type with confidence score.DecodeInformResult- decoded text plus charset/mimetype/line-separator metadata.
Layered Runtime Architecture¶
┌──────────────────────────────────────────────────────┐
│ Public API (__init__.py re-exports) │
└──────────────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────────────┐
│ Decoding Layer (decoders.py) │
│ decode(), decode_inform() │
│ - HTTP Content-Type parse + charset-first attempt │
│ - detector-assisted trial decode + text validation │
│ - optional MIME/line-separator metadata │
└──────────────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────────────┐
│ Inference Layer (inference.py) │
│ infer_*() orchestration + header/location context │
└──────────────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────────────┐
│ Detection Layer (detectors.py) │
│ detector registries + confidence results │
└──────────────────────────────────────────────────────┘
│
┌──────────────────────────────────────────────────────┐
│ Support Layer │
│ charsets.py, mimetypes.py, validation.py, │
│ lineseparators.py │
└──────────────────────────────────────────────────────┘
Decoder Flow (v3)¶
decode and decode_inform share the same decoding core:
Parse
http_content_typewhen provided.If header MIME is non-textual, raise
ContentDecodeImpossibility.If header charset is textual and decodable, decode with that charset first.
Otherwise, run detector-assisted trial decodes in configured codec order.
Apply text validation according to
Behaviors.text_validateandBehaviors.text_validate_confidence.Return text (
decode) or structured metadata (decode_inform).
Inference Flow¶
infer_* functions use contextual hints and detection orchestration:
Optionally parse
http_content_typedepending on behavior settings.Consider
location-based MIME hints.Run registered detectors for MIME and charset as configured.
Apply
*_defaultvalues only for fallback return semantics.Use
*_supplementvalues as hints to guide detection/validation.
Integration Notes¶
decodeis authoritative for byte-to-text conversion and raises on irrecoverable decode failure.decode_informis intended for callers that need text plus consistent decode metadata in one call.Detector registries are pluggable and backend-optional by design.
Trial codec ordering is behavior-driven and can be overridden by callers.