Processor Detection System Design

Overview

The processor detection system provides automated selection of appropriate inventory and structure processors for documentation sources. The design implements confidence-based scoring with TTL-based caching to balance performance with accuracy and data freshness.

Architecture

Design Principles

Genus-Based Separation

Inventory processors and structure processors operate in separate detection pipelines, allowing independent evolution and different selection criteria. Each genus maintains its own cache and processor registry.

Confidence-Based Selection

Processors return numerical confidence scores (0.0-1.0). Only processors exceeding CONFIDENCE_THRESHOLD_MINIMUM (0.5) are considered, with highest confidence and registration order as stable tiebreaker.

Immutable Data Structures

All detection results use immutable containers (__.immut.Dictionary, tuple) following project practices for thread safety and predictable behavior.

Wide Parameter, Narrow Return Pattern

Public functions accept abstract base classes for parameters and return specific concrete types, following established project practices.

Component Structure

Detection Orchestration (detection.py)

Central coordination of processor selection across inventory and structure genera. Provides both high-level convenience functions and low-level extensible functions for custom processor mappings.

Cache Management

TTL-based caching system with lazy expiration cleanup. Separate cache instances per processor genus enable different configuration and evolution patterns.

Processor Integration

Abstract base classes in processors.py define detection contracts. Format-specific implementations in inventories/ and structures/ subpackages provide concrete detection logic.

Interface Specifications

Primary Detection Functions

async def detect(
    auxdata: _state.Globals,
    source: str, /,
    genus: _interfaces.ProcessorGenera, *,
    processor_name: __.Absential[ str ] = __.absent,
) -> _processors.Detection

async def detect_inventory(
    auxdata: _state.Globals,
    source: str, /, *,
    processor_name: __.Absential[ str ] = __.absent,
) -> _processors.InventoryDetection

async def detect_structure(
    auxdata: _state.Globals,
    source: str, /, *,
    processor_name: __.Absential[ str ] = __.absent,
) -> _processors.StructureDetection

Contract: - Returns highest-confidence processor detection above threshold - Raises ProcessorInavailability if no suitable processor found - Bypasses detection when specific processor_name provided - Maintains detection results in genus-specific cache

Cache Access Functions

async def access_detections(
    auxdata: _state.Globals,
    source: str, /, *,
    genus: _interfaces.ProcessorGenera
) -> tuple[
    _processors.DetectionsByProcessor,
    __.Absential[ _processors.Detection ]
]

async def access_detections_ll(
    auxdata: _state.Globals,
    source: str, /, *,
    cache: DetectionsCache,
    processors: __.cabc.Mapping[ str, _processors.Processor ],
) -> tuple[
    _processors.DetectionsByProcessor,
    __.Absential[ _processors.Detection ]
]

Contract: - Returns all processor detections plus optimal selection - Executes fresh detection if cache miss or expiration - Low-level variant accepts arbitrary processor mapping for extensibility - Never raises exceptions; returns __.absent for missing optimal detection

Data Structures

Detection Cache Design

class DetectionsCacheEntry( __.immut.DataclassObject ):
    detections: __.cabc.Mapping[ str, _processors.Detection ]
    timestamp: float
    ttl: int

    @property
    def detection_optimal( self ) -> __.Absential[ _processors.Detection ]

    def invalid( self, current_time: float ) -> bool

class DetectionsCache( __.immut.DataclassObject ):
    ttl: int = 3600
    _entries: dict[ str, DetectionsCacheEntry ] = __.dcls.field(
        default_factory = dict[ str, DetectionsCacheEntry ] )

    def access_detections(
        self, source: str
    ) -> __.Absential[ _processors.DetectionsByProcessor ]

    def access_detection_optimal(
        self, source: str
    ) -> __.Absential[ _processors.Detection ]

    def add_entry(
        self, source: str, detections: _processors.DetectionsByProcessor
    ) -> __.typx.Self

Design Features: - TTL-based expiration with configurable timeouts per cache instance - Lazy cleanup on access operations to minimize overhead - Pre-computed optimal selection stored in cache entries - Method chaining support through __.typx.Self returns

Type Aliases

DetectionsByProcessor: __.typx.TypeAlias = __.cabc.Mapping[
    str, _processors.Detection ]

Purpose: Provides semantic clarity for function signatures and return types while maintaining wide parameter acceptance patterns.

Behavioral Contracts

Processor Selection Contract

Selection Algorithm: 1. Execute all processors in genus-specific registry on source 2. Filter results to confidence >= CONFIDENCE_THRESHOLD_MINIMUM (0.5) 3. Select highest confidence; use registration order for ties 4. Return __.absent if no processors meet confidence threshold

Error Handling: - Individual processor detection failures are logged but not propagated - Failed processors are excluded from selection consideration - Selection continues with remaining successful processors

Cache Management Contract

Cache Population: - Fresh detection triggered on cache miss or TTL expiration - All genus processors executed in parallel (future enhancement) - Results cached regardless of optimal selection success

Cache Access: - Thread-safe read operations using immutable data structures - Expired entries removed lazily on access - Missing or expired entries trigger fresh processor execution

TTL Management: - Configurable per-cache instance (default: 3600 seconds) - Based on cache entry creation timestamp - Independent expiration per source URL

Extension Points

Processor Genus Extension

Adding New Processor Types: 1. Extend ProcessorGenera enumeration in interfaces.py 2. Add genus-specific cache instance in detection.py 3. Update genus dispatch in access_detections function 4. Register processors in genus-specific registry

Processor Implementation Requirements: - Implement detect method returning confidence-scored Detection - Handle detection failures gracefully (should not raise exceptions) - Return confidence score in range 0.0-1.0 - Provide processor capabilities metadata

Cache Strategy Extension

Custom Cache Implementations: - DetectionsCache interface supports alternative implementations - Size-based eviction strategies can be added via subclassing - Different TTL strategies per processor type or source pattern - External cache stores (Redis, etc.) through interface compliance

Performance Optimization: - Parallel processor execution via async fanout (marked TODO) - Processor-specific timeout configuration - Cache warming strategies for frequently accessed sources

Error Handling Design

Exception Hierarchy

Current Exceptions: - ProcessorInavailability: No processor found above confidence threshold - Individual processor failures are caught and logged, not propagated

Recommended Future Enhancements:

class DetectionFailure( Omnierror, RuntimeError ):
    ''' Processor detection operation failed. '''

    def __init__(
        self, source: str, genus: str, processor_errors: __.cabc.Mapping[ str, Exception ]
    )

class ProcessorInavailability( Omnierror, RuntimeError ):
    ''' No processor found to handle source. '''

    def __init__(
        self, source: str, genus: str, attempted_processors: __.cabc.Sequence[ str ]
    )

Error Recovery Strategies

Processor Failure Recovery: - Continue selection with remaining functional processors - Log processor-specific errors for debugging - Maintain detection attempts in cache for diagnostic purposes

Cache Failure Recovery: - Fresh detection execution on cache corruption or errors - Graceful degradation to uncached operation - Error logging with cache rebuild capability

Design Trade-offs

Performance vs. Accuracy

Caching Trade-offs: - Advantage: Significant performance improvement for repeated source access - Advantage: Reduces external service load (HTTP requests, file system) - Disadvantage: Cached results may become stale for dynamic documentation - Mitigation: Configurable TTL values balance freshness vs. performance

Confidence Threshold Trade-offs: - Advantage: Prevents selection of unreliable processors - Advantage: Consistent, objective selection criteria - Disadvantage: Fixed threshold may not suit all processor types - Future Enhancement: Processor-specific or adaptive thresholds

Memory vs. Functionality

Cache Memory Trade-offs: - Advantage: Fast access to detection results without re-execution - Disadvantage: Memory usage grows with unique source URLs - Mitigation: TTL-based expiration provides bounded memory usage - Future Enhancement: Size-based LRU eviction strategies

Immutability Trade-offs: - Advantage: Thread-safe cache access without locking - Advantage: Predictable behavior and easier debugging - Disadvantage: Higher memory usage than mutable alternatives - Assessment: Acceptable trade-off for architectural benefits

Error Handling Evolution

Current Error Handling State

Exception Design: The current system uses a single ProcessorInavailability exception raised when no processor exceeds the confidence threshold. The exception provides minimal context, containing only a generic class name identifier.

Current Error Flow: 1. Detection functions attempt processor selection 2. Failed detection raises ProcessorInavailability( genus_name ) 3. CLI and MCP interfaces format generic error messages 4. Users receive non-specific guidance regardless of failure cause

Current Error Messages: - "No processor found to handle source: inventory" - "No processor found to handle source: structure" - "Cannot access documentation inventory: {source}"

Limitations: - No distinction between genus-specific failure modes - No actionable guidance for common URL pattern issues - Duplicate error formatting logic across interfaces - No automatic recovery for common documentation site patterns

Desired Error Handling State

Enhanced Exception Design: The ProcessorInavailability exception must be enhanced to provide additional context for better error messaging. The exception should indicate the processor genus (inventory vs structure), categorize the type of error encountered, and track whether URL pattern extension was attempted. These enhancements must maintain backward compatibility with existing exception handling code.

Enhanced Error Messages: - Inventory Detection: "No compatible inventory format detected at this documentation source" - URL Pattern Intelligence: "No inventory found - attempted common URL patterns" - Enhanced Accessibility: Context-aware guidance based on error type - Genus Clarity: Clear distinction between inventory and structure failures

Automatic URL Pattern Extension: Detection system must automatically attempt common documentation URL patterns when base URL detection fails. The system should try standard documentation site patterns (such as /en/latest/, /latest/, /main/) before reporting failure. This capability should be primarily applied to inventory detection where URL patterns are more standardized.

Functions Layer Error Response Design: Functions layer must catch processor detection exceptions and return structured error responses that contain pre-formatted, user-friendly error information. This eliminates the need for interface layers to interpret raw exceptions or format error messages. The structured responses should include error type, user-friendly titles, detailed messages, and actionable suggestions.

Cache Integration: When URL pattern extension discovers working URLs, detection cache entries are updated to use the successful URL for future requests, improving performance and user experience.

Interface Layer Simplification: CLI and MCP layers must be simplified to become response formatters that extract pre-formatted error information from functions layer responses. Interface layers should not interpret raw exceptions or generate error messages. Instead, they extract structured error information and apply appropriate display formatting for their respective interfaces.

Implementation Phases: 1. Functions Layer Error Response Design: Implement structured error responses in functions layer 2. Automatic URL Patterns: Implement intelligent URL extension for inventory detection 3. Enhanced Exception Context: Add contextual fields to ProcessorInavailability 4. URL Pattern Detection: Add utilities for documentation site pattern recognition

This detection system design provides robust, extensible automated processor selection while maintaining clean architectural boundaries and established project practices compliance.