002. Detector Registry Specification

Overview

This document specifies the detector registry architecture for pluggable backend support in the detextive library. The registry system enables configurable detector precedence, graceful degradation with optional dependencies, and dynamic fallback strategies for robust detection across diverse environments.

The design follows established project practices for type aliases, interface contracts, and module organization while providing extensibility for third-party detection backends.

Registry Architecture

Core Registry Types

Detector Function Signatures

CharsetDetector: __.typx.TypeAlias = __.cabc.Callable[
    [ Content, Behaviors ],
    CharsetResult | __.types.NotImplementedType
]

MimetypeDetector: __.typx.TypeAlias = __.cabc.Callable[
    [ Content, Behaviors ],
    MimetypeResult | __.types.NotImplementedType
]

Registry Container Types

charset_detectors: __.accret.Dictionary[ str, CharsetDetector ]
mimetype_detectors: __.accret.Dictionary[ str, MimetypeDetector ]

Registry Contract Specifications: - Detectors return specific result types with confidence scoring - NotImplemented return value indicates missing optional dependency - Registry keys provide user-configurable detector ordering - Detector functions accept standardized parameters for consistent interfaces

Registry Registration Pattern

Dynamic Registration System

def _detect_via_chardet(
    content: Content, behaviors: Behaviors
) -> CharsetResult | __.types.NotImplementedType:
    ''' Detects charset using chardet library. '''
    try:
        from chardet import detect as _chardet_detect
    except ImportError:
        return NotImplemented

    # Detection implementation would follow here

def _detect_via_charset_normalizer(
    content: Content, behaviors: Behaviors
) -> CharsetResult | __.types.NotImplementedType:
    ''' Detects charset using charset-normalizer library. '''
    try:
        from charset_normalizer import from_bytes
    except ImportError:
        return NotImplemented

    # Detection implementation would follow here

# Registration at module initialization
charset_detectors[ 'chardet' ] = _detect_via_chardet
charset_detectors[ 'charset-normalizer' ] = _detect_via_charset_normalizer

Registration Design Principles: - Lazy import strategy with graceful ImportError handling - Consistent function signature across all detector implementations - Registry key naming matches common library names for intuitive configuration - Module-level registration enables import-time detector discovery

Optional Dependency Strategy

Graceful Degradation Pattern

NotImplemented Return Protocol

The registry system implements graceful degradation where: - Detectors return NotImplemented for missing optional dependencies - Registry iteration continues until successful detection - Exception raising occurs only when all configured detectors fail - User-configurable detector ordering enables fallback preferences

Configuration Integration

Behavior-Driven Detector Selection

class Behaviors( __.immut.DataclassObject ):
    ''' Configuration for detector registry usage. '''

    charset_detectors_order: __.typx.Annotated[
        __.cabc.Sequence[ str ],
        __.ddoc.Doc( ''' Order in which charset detectors are applied. ''' ),
    ] = ( 'chardet', 'charset-normalizer' )

    mimetype_detectors_order: __.typx.Annotated[
        __.cabc.Sequence[ str ],
        __.ddoc.Doc( ''' Order in which MIME type detectors are applied. ''' ),
    ] = ( 'magic', 'puremagic' )

Configuration Design Features: - User-configurable detector precedence through sequence ordering - Default ordering based on library reliability and performance characteristics - Runtime modification support for dynamic behavior adjustment - Validation ensures only registered detectors attempted

Multiple Backend Support

Charset Detection Backends

Supported Charset Libraries

# Standard charset detection backends
charset_detectors[ 'chardet' ]              # Statistical analysis, UTF-8 bias
charset_detectors[ 'charset-normalizer' ]   # Enhanced heuristics, multiple algorithms

Backend Characteristics: - chardet: Mature statistical analysis with proven UTF-8 bias handling - charset-normalizer: Enhanced detection algorithms with multiple confidence scoring

Registration Strategy: - Both libraries registered with graceful ImportError handling - Default ordering prioritizes chardet for proven reliability - User configuration enables alternative precedence based on use case requirements

MIME Type Detection Backends

Supported MIME Type Libraries

# MIME type detection backends
mimetype_detectors[ 'magic' ]      # python-magic (libmagic bindings)
mimetype_detectors[ 'puremagic' ]  # Pure Python magic byte detection

Backend Selection Strategy: - python-magic: Comprehensive magic byte database via libmagic - puremagic: Pure Python implementation for deployment simplicity - Fallback ordering ensures detection capability across diverse environments

Detection Priority Logic: - Primary detection via content analysis (magic bytes) - Secondary detection via filename extension analysis - Default MIME type assignment based on available context

Interface Contract Design

Detector Function Contracts

Standardized Parameters

def detector_function(
    content: Content,           # Raw byte content for analysis
    behaviors: Behaviors        # Configuration object with detection preferences
) -> DetectionResult | __.types.NotImplementedType:
    ''' Standard detector function signature. '''

Return Value Specifications: - Successful detection returns structured result with confidence scoring - Missing dependencies indicated by NotImplemented return value - Exception raising reserved for genuine detection failures - Result types provide consistent interface across all detection backends

Parameter Design Principles: - Wide parameter acceptance for maximum backend flexibility - Behavior-driven configuration enables detector-specific optimization - Content parameter accepts any bytes-like input for broad compatibility

Result Type Integration

Registry Return Value Contracts: - Successful detection returns CharsetResult or MimetypeResult (defined in API design) - Missing dependencies indicated by NotImplemented return value - Exception raising reserved for genuine detection failures - Confidence scoring enables quality-based selection among multiple results

Registry Architecture Summary

Key Design Features: - Pluggable backend system with standardized detector function signatures - Graceful degradation through NotImplemented return protocol - User-configurable detector precedence via Behaviors configuration - Support for multiple optional dependencies per detection type

Implementation Architecture: - Registry containers in detectors.py module - Type aliases for detector function signatures - Dynamic registration with import-time discovery - Registry-based dispatch in core detection functions