System Overview¶
The detextive library implements a faithful functional reproduction to consolidate text detection capabilities from multiple packages. The first iteration prioritizes behavioral fidelity and minimal migration effort over architectural sophistication.
Major Components¶
Core Detection Functions¶
- Public Functional API
Core detection and inference functions with confidence-aware behavior:
detect_charset(content, *, behaviors=BEHAVIORS_DEFAULT, ...)- Character encoding detectiondetect_charset_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...)- Charset detection with confidence scoringdetect_mimetype(content, *, behaviors=BEHAVIORS_DEFAULT, ...)- MIME type detectiondetect_mimetype_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...)- MIME type detection with confidence scoringinfer_charset(content, *, behaviors=BEHAVIORS_DEFAULT, ...)- Charset inference with validationinfer_charset_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...)- Charset inference with confidence scoringinfer_mimetype_charset(content, *, behaviors=BEHAVIORS_DEFAULT, ...)- Combined MIME type and charset inferenceinfer_mimetype_charset_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...)- Combined detection with confidence scoringdecode(content, *, behaviors=BEHAVIORS_DEFAULT, ...)- High-level bytes-to-text decoding with validationis_textual_mimetype(mimetype)- Textual MIME type validationis_valid_text(text, profile=PROFILE_TEXTUAL)- Unicode-aware text validation
- Core Types and Configuration
Shared data structures for confidence-aware behavior:
CharsetResult(charset, confidence)- Charset detection results with confidence scoring (0.0-1.0)MimetypeResult(mimetype, confidence)- MIME type detection results with confidence scoring (0.0-1.0)Behaviors- Configurable detection behavior with confidence thresholds and failure handlingBehaviorTristate- When to apply behaviors (Never/AsNeeded/Always)CodecSpecifiers- Dynamic codec resolution (FromInference/OsDefault/UserSupplement/etc.)DetectFailureActions- Failure handling strategy (Default/Error) for graceful degradation
- Text Validation System
Unicode-aware text validation with configurable profiles:
TextValidationProfile- Validation rules and character acceptance policiesPROFILE_TEXTUAL- General textuality validation (lenient)PROFILE_TERMINAL_SAFE- Terminal output safety (strict)PROFILE_PRINTER_SAFE- Printer output safety (form feed allowed)
- Line Separator Processing
Direct migration of proven enumeration and utilities:
LineSeparatorsenum - Detection, normalization, and nativization methods
Component Relationships¶
v2.0 Layered Architecture
┌─────────────────────────────────────────────────┐
│ Public API Layer (decoders.py) │
│ decode() - High-level bytes-to-text function │
└─────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────┐
│ Inference Layer (inference.py) │
│ infer_charset_confidence() infer_mimetype() │
│ Context-aware orchestration + HTTP parsing │
└─────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────┐
│ Detection Layer (detectors.py) │
│ detect_charset_confidence() detect_mimetype() │
│ Core detection with confidence scoring │
└─────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────┐
│ Support Modules (charsets.py, validation.py) │
│ Trial decoding + Text validation + MIME utils │
└─────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────┐
│ External Dependencies │
│ chardet charset-normalizer puremagic │
│ python-magic mimetypes (stdlib) [optional] │
└─────────────────────────────────────────────────┘
v2.0 Data Flow
Input Processing: Functions receive byte content, behaviors configuration, optional default values, and HTTP/location context
Registry-Based Detection: Core detectors iterate through configured backends (chardet, charset-normalizer, puremagic, python-magic) returning CharsetResult/MimetypeResult objects with confidence scores
Smart Decision Making: Confidence thresholds drive AsNeeded behavior for trial decode and text validation
Failure Handling: DetectFailureActions configuration determines whether to return default values (graceful degradation) or raise exceptions
Layered Inference: Higher-level functions orchestrate detection, validation, and configurable error handling
Validated Output: Text validation ensures decoded content meets specified profiles for safety/quality
Integration Patterns¶
- Drop-in Replacement Strategy
Existing code can replace imports with minimal changes:
# Before: from mimeogram.acquirers import _detect_charset # After: from detextive import detect_charset charset = detect_charset(content_bytes)
- Behavioral Fidelity
Preserves exact existing behavior:
UTF-8 bias with validation from mimeogram charset detection
Extensible textual MIME type patterns from all implementations
Fallback chains (puremagic → mimetypes) from mimeogram
Complex parameter handling from
detect_mimetype_and_charsetHeuristic validation from
is_reasonable_text_contentError handling and exception types maintained
- Implementation Strategy
Direct consolidation of proven function logic
Minimal abstraction to preserve performance characteristics
Same dependencies and detection libraries as existing implementations
Architectural Patterns¶
- Faithful Functional Reproduction
Direct consolidation of existing functional implementations without architectural changes (see ADR-001).
- Consolidation Pattern
Multiple implementations merged into single functions:
chardet: Statistical charset detection with UTF-8 bias
puremagic: Pure Python magic byte detection (primary)
mimetypes: Standard library extension-based fallback
LineSeparators: Byte-level line ending detection and normalization
- v2.0 Evolution
ADR-003 and ADR-006 document the context-aware detection architecture for v2.0 that addresses real-world integration challenges:
Context-driven detection utilizing HTTP headers, location, and content analysis
Confidence-based result types with specific CharsetResult/MimetypeResult objects
Configurable validation behaviors for performance and security requirements
Default return behavior pattern enabling graceful degradation for detection failures
Enhanced function interfaces maintaining backward compatibility
- Detector Registry Architecture
ADR-002 documents the implemented pluggable backend system:
Dynamic detector registration with type aliases for CharsetDetector/MimetypeDetector functions
Configurable detector precedence via Behaviors.charset_detectors_order and mimetype_detectors_order
Graceful degradation with NotImplemented return pattern for missing optional dependencies
Registry dictionaries (charset_detectors, mimetype_detectors) enabling runtime backend selection