System Overview¶
The detextive library implements a faithful functional reproduction to consolidate text detection capabilities from multiple packages. The first iteration prioritizes behavioral fidelity and minimal migration effort over architectural sophistication.
Major Components¶
Core Detection Functions¶
- Public Functional API
Direct consolidation of proven functions providing drop-in compatibility:
detect_charset(content)
- Character encoding with UTF-8 biasdetect_mimetype(content, location)
- MIME type with fallback chainsdetect_mimetype_and_charset(content, location, *, mimetype=absent, charset=absent)
- Complex parameter handling from mimeogramis_textual_mimetype(mimetype)
- Textual MIME type validationis_reasonable_text_content(content)
- Heuristic text vs binary
- Line Separator Processing
Direct migration of proven enumeration and utilities:
LineSeparators
enum - Detection, normalization, and nativization methods
Component Relationships¶
Functional Architecture
┌─────────────────────────────────────────────────┐
│ Public Functions │
│ detect_mimetype() detect_charset() etc... │
└─────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────┐
│ Consolidated Detection Logic │
│ Faithful reproduction of existing logic │
└─────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────┐
│ External Dependencies │
│ chardet puremagic mimetypes (stdlib) │
└─────────────────────────────────────────────────┘
Data Flow
Input Processing: Functions receive byte content and optional metadata
Direct Analysis: Functions apply statistical analysis, pattern matching, and heuristics using consolidated logic from existing implementations
Validated Logic: All detection behavior reproduced exactly from proven mimeogram, cache proxy, and ai-experiments implementations
Output: Identical return values and types as existing implementations
Integration Patterns¶
- Drop-in Replacement Strategy
Existing code can replace imports with minimal changes:
# Before: from mimeogram.acquirers import _detect_charset # After: from detextive import detect_charset charset = detect_charset(content_bytes)
- Behavioral Fidelity
Preserves exact existing behavior:
UTF-8 bias with validation from mimeogram charset detection
Extensible textual MIME type patterns from all implementations
Fallback chains (puremagic → mimetypes) from mimeogram
Complex parameter handling from
detect_mimetype_and_charset
Heuristic validation from
is_reasonable_text_content
Error handling and exception types maintained
- Implementation Strategy
Direct consolidation of proven function logic
Minimal abstraction to preserve performance characteristics
Same dependencies and detection libraries as existing implementations
Architectural Patterns¶
- Faithful Functional Reproduction
Direct consolidation of existing functional implementations without architectural changes (see ADR-001).
- Consolidation Pattern
Multiple implementations merged into single functions:
chardet: Statistical charset detection with UTF-8 bias
puremagic: Pure Python magic byte detection (primary)
mimetypes: Standard library extension-based fallback
LineSeparators: Byte-level line ending detection and normalization
- Future Extensibility
ADR-002 documents deferred architectural enhancements for future iterations:
Internal detector classes for configuration and testing
Consolidated result objects for multi-value operations
Plugin architecture for alternative detection backends