003. Context-Aware Detection Architecture v2.0

Status

Accepted

Context

Real-world integration analysis from downstream packages (librovore) revealed fundamental limitations in the v1.x functional API that create significant integration burden. The primary integration pain points identified include:

Redundant Detection Operations: Current integration patterns require multiple function calls for comprehensive detection workflows, creating performance overhead and code complexity.

Redundant Detection Overhead: Multiple function calls perform overlapping content analysis (detect_mimetype_and_charset + is_textual_content), resulting in performance penalties for comprehensive detection workflows.

Context Loss: Available HTTP headers cannot be utilized in current API, forcing downstream packages to implement custom fallback logic that duplicates detection functionality.

Validation Rigidity: No control over which validations occur when, leading to unnecessary computational work and inappropriate error handling for specific use cases.

These limitations violate the core product requirement (REQ-005) of providing drop-in replacement interfaces that minimize migration effort. The current functional reproduction approach successfully consolidated duplicate implementations but created new integration friction for context-rich environments.

Decision

For v2.0, we will implement a Context-Aware Detection Architecture that addresses real-world integration challenges while maintaining backward compatibility with enhanced function implementations.

Core Architectural Components:

Enhanced Function Interface: * detect_charset(content, /, *, behaviors=default, default=absent, mimetype=absent, location=absent) - Enhanced charset detection with configurable behaviors * infer_mimetype_charset(content, /, *, behaviors=default, http_content_type=absent, location=absent, ...) - Primary combined detection with HTTP context support * detect_mimetype(content, /, *, behaviors=default, charset=absent, location=absent) - Focused MIME type detection

Context-Driven Detection Strategy: * HTTP Content-Type headers processed first when available via http_content_type parameter * Location/filename extension analysis as secondary fallback * Magic bytes content analysis as final fallback * Detection methods selected automatically based on available context and Behaviors configuration

Configurable Validation Behaviors: * Behaviors dataclass controls validation execution (trial_decode, validate_printable) * printable_threshold parameter for character validation tolerance * Conditional execution prevents unnecessary validation overhead

Confidence-Based Result Types: * CharsetResult(charset, confidence) for charset detection results * MimetypeResult(mimetype, confidence) for MIME type detection results * Confidence scoring enables AsNeeded behavior and quality assessment

Backward Compatibility Strategy: * Existing v1.x functions enhanced with new capabilities while preserving signatures * No breaking changes to current function behavior * Enhanced capabilities available through optional parameters

Alternatives

Comprehensive Detection Result Object

Benefits: Single detection call returns structured result with metadata Drawbacks: Heavy-weight object for simple use cases, complex field interpretation Rejection Reason: Over-engineering for typical workflows requiring simple tuple returns

Plugin Architecture in v2.0

Benefits: Maximum extensibility, support for alternative detection backends Drawbacks: Significant complexity increase, premature optimization Rejection Reason: Architectural scope too large, deferred to future iteration

Separate v2.0 Package

Benefits: Clean API design without backward compatibility constraints Drawbacks: Ecosystem fragmentation, migration complexity Rejection Reason: Violates consolidation goal, creates maintenance burden

Function Overload Pattern

Benefits: Multiple function signatures for different use cases Drawbacks: Python typing complexity, unclear function selection Rejection Reason: Less maintainable than optional parameters with clear defaults

Consequences

Positive Consequences

  • Unified Detection: Single function calls provide comprehensive detection with confidence scoring

  • Context Fusion: Single detection call leverages all available context (HTTP headers, location, content)

  • Performance Optimization: Conditional validation prevents unnecessary computational overhead

  • Backward Compatibility: Existing code continues working with enhanced capabilities

  • Integration Simplification: Common integration patterns require minimal code

Negative Consequences

  • Interface Complexity: Additional optional parameters increase cognitive load

  • Implementation Complexity: Context-driven detection requires sophisticated internal logic

  • Testing Matrix: Behaviors combinations create large test space requiring systematic coverage

  • Documentation Overhead: Enhanced capabilities require comprehensive usage documentation

Neutral Consequences

  • Migration Timeline: v2.0 represents significant architectural evolution requiring careful migration planning

  • Dependency Evolution: May enable future upgrade of detection libraries (charset-normalizer)

  • Plugin Foundation: Architecture provides foundation for future plugin system without committing to implementation

Implementation Implications

  • Focus on context-driven detection logic that automatically selects appropriate methods

  • Implement detector registry system with configurable backend precedence

  • Design Behaviors dataclass for intuitive validation control and detector ordering

  • Maintain strict backward compatibility through enhanced function implementations

  • Create comprehensive test suite covering behavior combinations and context scenarios

  • Document migration patterns for common integration scenarios

Integration with Existing Architecture

This decision supersedes the limitations identified in ADR-002 by providing a concrete v2.0 architecture that addresses real-world integration needs while maintaining the functional API paradigm established in ADR-001. The context-aware approach extends the faithful reproduction principle to include context utilization and configurable behaviors without breaking existing usage patterns.