.. vim: set fileencoding=utf-8: .. -*- coding: utf-8 -*- .. +--------------------------------------------------------------------------+ | | | Licensed under the Apache License, Version 2.0 (the "License"); | | you may not use this file except in compliance with the License. | | You may obtain a copy of the License at | | | | http://www.apache.org/licenses/LICENSE-2.0 | | | | Unless required by applicable law or agreed to in writing, software | | distributed under the License is distributed on an "AS IS" BASIS, | | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | | See the License for the specific language governing permissions and | | limitations under the License. | | | +--------------------------------------------------------------------------+ ******************************************************************************* System Overview ******************************************************************************* The **detextive** library implements a faithful functional reproduction to consolidate text detection capabilities from multiple packages. The first iteration prioritizes behavioral fidelity and minimal migration effort over architectural sophistication. Major Components =============================================================================== Core Detection Functions ------------------------------------------------------------------------------- **Public Functional API** Core detection and inference functions with confidence-aware behavior: * ``detect_charset(content, *, behaviors=BEHAVIORS_DEFAULT, ...)`` - Character encoding detection * ``detect_charset_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...)`` - Charset detection with confidence scoring * ``detect_mimetype(content, *, behaviors=BEHAVIORS_DEFAULT, ...)`` - MIME type detection * ``detect_mimetype_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...)`` - MIME type detection with confidence scoring * ``infer_charset(content, *, behaviors=BEHAVIORS_DEFAULT, ...)`` - Charset inference with validation * ``infer_charset_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...)`` - Charset inference with confidence scoring * ``infer_mimetype_charset(content, *, behaviors=BEHAVIORS_DEFAULT, ...)`` - Combined MIME type and charset inference * ``infer_mimetype_charset_confidence(content, *, behaviors=BEHAVIORS_DEFAULT, ...)`` - Combined detection with confidence scoring * ``decode(content, *, behaviors=BEHAVIORS_DEFAULT, ...)`` - High-level bytes-to-text decoding with validation * ``is_textual_mimetype(mimetype)`` - Textual MIME type validation * ``is_valid_text(text, profile=PROFILE_TEXTUAL)`` - Unicode-aware text validation **Core Types and Configuration** Shared data structures for confidence-aware behavior: * ``CharsetResult(charset, confidence)`` - Charset detection results with confidence scoring (0.0-1.0) * ``MimetypeResult(mimetype, confidence)`` - MIME type detection results with confidence scoring (0.0-1.0) * ``Behaviors`` - Configurable detection behavior with confidence thresholds and failure handling * ``BehaviorTristate`` - When to apply behaviors (Never/AsNeeded/Always) * ``CodecSpecifiers`` - Dynamic codec resolution (FromInference/OsDefault/UserSupplement/etc.) * ``DetectFailureActions`` - Failure handling strategy (Default/Error) for graceful degradation **Text Validation System** Unicode-aware text validation with configurable profiles: * ``TextValidationProfile`` - Validation rules and character acceptance policies * ``PROFILE_TEXTUAL`` - General textuality validation (lenient) * ``PROFILE_TERMINAL_SAFE`` - Terminal output safety (strict) * ``PROFILE_PRINTER_SAFE`` - Printer output safety (form feed allowed) **Line Separator Processing** Direct migration of proven enumeration and utilities: * ``LineSeparators`` enum - Detection, normalization, and nativization methods Component Relationships =============================================================================== **v2.0 Layered Architecture** .. code-block:: ┌─────────────────────────────────────────────────┐ │ Public API Layer (decoders.py) │ │ decode() - High-level bytes-to-text function │ └─────────────────────────────────────────────────┘ │ ┌─────────────────────────────────────────────────┐ │ Inference Layer (inference.py) │ │ infer_charset_confidence() infer_mimetype() │ │ Context-aware orchestration + HTTP parsing │ └─────────────────────────────────────────────────┘ │ ┌─────────────────────────────────────────────────┐ │ Detection Layer (detectors.py) │ │ detect_charset_confidence() detect_mimetype() │ │ Core detection with confidence scoring │ └─────────────────────────────────────────────────┘ │ ┌─────────────────────────────────────────────────┐ │ Support Modules (charsets.py, validation.py) │ │ Trial decoding + Text validation + MIME utils │ └─────────────────────────────────────────────────┘ │ ┌─────────────────────────────────────────────────┐ │ External Dependencies │ │ chardet charset-normalizer puremagic │ │ python-magic mimetypes (stdlib) [optional] │ └─────────────────────────────────────────────────┘ **v2.0 Data Flow** 1. **Input Processing**: Functions receive byte content, behaviors configuration, optional default values, and HTTP/location context 2. **Registry-Based Detection**: Core detectors iterate through configured backends (chardet, charset-normalizer, puremagic, python-magic) returning CharsetResult/MimetypeResult objects with confidence scores 3. **Smart Decision Making**: Confidence thresholds drive AsNeeded behavior for trial decode and text validation 4. **Failure Handling**: DetectFailureActions configuration determines whether to return default values (graceful degradation) or raise exceptions 5. **Layered Inference**: Higher-level functions orchestrate detection, validation, and configurable error handling 6. **Validated Output**: Text validation ensures decoded content meets specified profiles for safety/quality Integration Patterns =============================================================================== **Drop-in Replacement Strategy** Existing code can replace imports with minimal changes: .. code-block:: python # Before: from mimeogram.acquirers import _detect_charset # After: from detextive import detect_charset charset = detect_charset(content_bytes) **Behavioral Fidelity** Preserves exact existing behavior: * UTF-8 bias with validation from mimeogram charset detection * Extensible textual MIME type patterns from all implementations * Fallback chains (puremagic → mimetypes) from mimeogram * Complex parameter handling from ``detect_mimetype_and_charset`` * Heuristic validation from ``is_reasonable_text_content`` * Error handling and exception types maintained **Implementation Strategy** * Direct consolidation of proven function logic * Minimal abstraction to preserve performance characteristics * Same dependencies and detection libraries as existing implementations Architectural Patterns =============================================================================== **Faithful Functional Reproduction** Direct consolidation of existing functional implementations without architectural changes (see ADR-001). **Consolidation Pattern** Multiple implementations merged into single functions: * **chardet**: Statistical charset detection with UTF-8 bias * **puremagic**: Pure Python magic byte detection (primary) * **mimetypes**: Standard library extension-based fallback * **LineSeparators**: Byte-level line ending detection and normalization **v2.0 Evolution** ADR-003 and ADR-006 document the context-aware detection architecture for v2.0 that addresses real-world integration challenges: * Context-driven detection utilizing HTTP headers, location, and content analysis * Confidence-based result types with specific CharsetResult/MimetypeResult objects * Configurable validation behaviors for performance and security requirements * Default return behavior pattern enabling graceful degradation for detection failures * Enhanced function interfaces maintaining backward compatibility **Detector Registry Architecture** ADR-002 documents the implemented pluggable backend system: * Dynamic detector registration with type aliases for CharsetDetector/MimetypeDetector functions * Configurable detector precedence via Behaviors.charset_detectors_order and mimetype_detectors_order * Graceful degradation with NotImplemented return pattern for missing optional dependencies * Registry dictionaries (charset_detectors, mimetype_detectors) enabling runtime backend selection