System Overview¶

The detextive library implements a faithful functional reproduction to consolidate text detection capabilities from multiple packages. The first iteration prioritizes behavioral fidelity and minimal migration effort over architectural sophistication.

Major Components¶

Core Detection Functions¶

Public Functional API

Core detection and inference functions with confidence-aware behavior:

detect_charset(content, *, behaviors=BEHAVIORS_DEFAULT, ...) - Character encoding detection
detect_charset_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...) - Charset detection with confidence scoring
detect_mimetype(content, *, behaviors=BEHAVIORS_DEFAULT, ...) - MIME type detection
detect_mimetype_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...) - MIME type detection with confidence scoring
infer_charset(content, *, behaviors=BEHAVIORS_DEFAULT, ...) - Charset inference with validation
infer_charset_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...) - Charset inference with confidence scoring
infer_mimetype_charset(content, *, behaviors=BEHAVIORS_DEFAULT, ...) - Combined MIME type and charset inference
infer_mimetype_charset_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...) - Combined detection with confidence scoring
decode(content, *, behaviors=BEHAVIORS_DEFAULT, ...) - High-level bytes-to-text decoding with validation
is_textual_mimetype(mimetype) - Textual MIME type validation
is_valid_text(text, profile=PROFILE_TEXTUAL) - Unicode-aware text validation

Core Types and Configuration

Shared data structures for confidence-aware behavior:

CharsetResult(charset, confidence) - Charset detection results with confidence scoring (0.0-1.0)
MimetypeResult(mimetype, confidence) - MIME type detection results with confidence scoring (0.0-1.0)
Behaviors - Configurable detection behavior with confidence thresholds and failure handling
BehaviorTristate - When to apply behaviors (Never/AsNeeded/Always)
CodecSpecifiers - Dynamic codec resolution (FromInference/OsDefault/UserSupplement/etc.)
DetectFailureActions - Failure handling strategy (Default/Error) for graceful degradation

Text Validation System

Unicode-aware text validation with configurable profiles:

TextValidationProfile - Validation rules and character acceptance policies
PROFILE_TEXTUAL - General textuality validation (lenient)
PROFILE_TERMINAL_SAFE - Terminal output safety (strict)
PROFILE_PRINTER_SAFE - Printer output safety (form feed allowed)

Line Separator Processing

Direct migration of proven enumeration and utilities:

LineSeparators enum - Detection, normalization, and nativization methods

Component Relationships¶

v2.0 Layered Architecture

┌─────────────────────────────────────────────────┐
│        Public API Layer (decoders.py)         │
│  decode() - High-level bytes-to-text function  │
└─────────────────────────────────────────────────┘
                        │
┌─────────────────────────────────────────────────┐
│     Inference Layer (inference.py)            │
│  infer_charset_confidence()  infer_mimetype()   │
│  Context-aware orchestration + HTTP parsing    │
└─────────────────────────────────────────────────┘
                        │
┌─────────────────────────────────────────────────┐
│   Detection Layer (detectors.py)              │
│  detect_charset_confidence()  detect_mimetype() │
│  Core detection with confidence scoring        │
└─────────────────────────────────────────────────┘
                        │
┌─────────────────────────────────────────────────┐
│  Support Modules (charsets.py, validation.py) │
│  Trial decoding + Text validation + MIME utils │
└─────────────────────────────────────────────────┘
                        │
┌─────────────────────────────────────────────────┐
│            External Dependencies               │
│  chardet  charset-normalizer  puremagic        │
│  python-magic  mimetypes (stdlib) [optional]   │
└─────────────────────────────────────────────────┘

v2.0 Data Flow

Input Processing: Functions receive byte content, behaviors configuration, optional default values, and HTTP/location context
Registry-Based Detection: Core detectors iterate through configured backends (chardet, charset-normalizer, puremagic, python-magic) returning CharsetResult/MimetypeResult objects with confidence scores
Smart Decision Making: Confidence thresholds drive AsNeeded behavior for trial decode and text validation
Failure Handling: DetectFailureActions configuration determines whether to return default values (graceful degradation) or raise exceptions
Layered Inference: Higher-level functions orchestrate detection, validation, and configurable error handling
Validated Output: Text validation ensures decoded content meets specified profiles for safety/quality

Integration Patterns¶

Drop-in Replacement Strategy

Existing code can replace imports with minimal changes:

# Before: from mimeogram.acquirers import _detect_charset
# After:  from detextive import detect_charset
charset = detect_charset(content_bytes)

Behavioral Fidelity

Preserves exact existing behavior:

UTF-8 bias with validation from mimeogram charset detection
Extensible textual MIME type patterns from all implementations
Fallback chains (puremagic → mimetypes) from mimeogram
Complex parameter handling from detect_mimetype_and_charset
Heuristic validation from is_reasonable_text_content
Error handling and exception types maintained

Implementation Strategy

Direct consolidation of proven function logic
Minimal abstraction to preserve performance characteristics
Same dependencies and detection libraries as existing implementations

Architectural Patterns¶

Faithful Functional Reproduction

Direct consolidation of existing functional implementations without architectural changes (see ADR-001).

Consolidation Pattern

Multiple implementations merged into single functions:

chardet: Statistical charset detection with UTF-8 bias
puremagic: Pure Python magic byte detection (primary)
mimetypes: Standard library extension-based fallback
LineSeparators: Byte-level line ending detection and normalization

v2.0 Evolution

ADR-003 and ADR-006 document the context-aware detection architecture for v2.0 that addresses real-world integration challenges:

Context-driven detection utilizing HTTP headers, location, and content analysis
Confidence-based result types with specific CharsetResult/MimetypeResult objects
Configurable validation behaviors for performance and security requirements
Default return behavior pattern enabling graceful degradation for detection failures
Enhanced function interfaces maintaining backward compatibility

Detector Registry Architecture

ADR-002 documents the implemented pluggable backend system:

Dynamic detector registration with type aliases for CharsetDetector/MimetypeDetector functions
Configurable detector precedence via Behaviors.charset_detectors_order and mimetype_detectors_order
Graceful degradation with NotImplemented return pattern for missing optional dependencies
Registry dictionaries (charset_detectors, mimetype_detectors) enabling runtime backend selection