Inventory Processors Architecture¶
Overview¶
Inventory processors extract and provide object inventories from documentation
sources, enabling discovery and search operations across different documentation
formats. These processors form the foundation of librovore’s inventory-based
architecture, converting format-specific inventory data into universal
InventoryObject instances.
Role in librovore architecture: Inventory processors serve as the primary interface between external documentation sources and librovore’s search and discovery operations. They enable format-agnostic inventory operations while maintaining complete source attribution and metadata preservation.
Relationship to structure processors: Inventory processors discover and enumerate documentation objects, while structure processors extract content from those objects. The two processor types work together through capability-based filtering to ensure inventory objects are only sent to compatible structure processors.
Universal object interface principles: All inventory processors return
InventoryObject instances regardless of source format, providing type safety,
consistent search operations, and multi-source aggregation capabilities. The
universal interface isolates format differences within processor implementations
while enabling uniform operations across all inventory types.
Architecture Patterns¶
Universal Inventory Object Interface¶
Decision: All inventory processors return InventoryObject instances
rather than format-specific dictionaries.
Rationale: Provides type safety, enables consistent search operations, and supports multi-source aggregation capabilities. The universal interface isolates format differences within processor implementations.
Impact: Processors become responsible for complete source attribution and metadata normalization, while search and ranking operations work uniformly across all inventory types.
The universal interface follows a consistent dataflow pattern across all processor types:
External Inventory Source
│
▼
┌─────────────────────┐
│ Detection Phase │ ◄─── Confidence scoring
└─────────────────────┘ URL derivation
│
▼
┌─────────────────────┐
│ Loading Phase │ ◄─── Raw data retrieval
└─────────────────────┘ Format validation
│
▼
┌─────────────────────┐
│ Transformation │ ◄─── Format-specific parsing
│ Phase │ Universal object creation
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Filtering Phase │ ◄─── Criteria application
└─────────────────────┘ Results ranking
│
▼
Universal InventoryObject Collection
Source Attribution Strategy¶
Decision: Every inventory object includes complete provenance information including processor type, location URL, and format-specific metadata.
Rationale: Enables debugging, caching optimization, and future multi-source operations. Complete attribution allows the system to understand object origins without maintaining separate tracking mechanisms.
Impact: Processors must provide consistent metadata extraction and URL
normalization. Format-specific details are preserved in the specifics
container without affecting universal operations.
Source attribution includes: - Processor identification: Clear identification of the processor type that created the object - Location attribution: Complete URL tracking for cache management and debugging - Format metadata preservation: Format-specific details maintained in structured containers - Provenance tracking: Full chain of custody from source to object creation
Confidence-Based Detection¶
Decision: Processor detection uses numerical confidence scores rather than boolean availability checks.
Rationale: Allows graceful handling of edge cases where multiple processors might partially support a documentation source. Provides foundation for processor precedence and quality assessment.
Impact: Detection algorithms must provide meaningful confidence differentiation. The detection system can make informed choices when multiple processors are available for a source.
Confidence scoring methodology: - High confidence (0.9+): Well-structured inventories with substantial content and clear format indicators - Medium confidence (0.7+): Valid inventories meeting minimum structural requirements - Low confidence (0.5+): Partial or potentially problematic inventories that may still be usable - Below threshold: Malformed, empty, or incompatible inventories rejected from consideration
Error Handling Patterns¶
Consistent Error Categories: All processors handle standard error types with uniform reporting and graceful degradation:
Accessibility Errors: Network failures, missing resources, permission denials
Format Errors: Invalid inventory structure, parsing failures, unsupported versions
Configuration Errors: Invalid filter parameters, unsupported operations
System Errors: Unexpected failures, resource exhaustion
Quality Assurance Patterns: Multi-stage validation from raw data through final object creation ensures data integrity and provides detailed error context for debugging inventory processing issues.
Performance Characteristics¶
Detection Caching: Detection results are cached with appropriate TTL values to avoid repeated expensive operations while maintaining data freshness for dynamic documentation sources.
Inventory Caching: Raw inventory data caching at the processor level reduces external service load while ensuring consistent object creation across multiple filter operations.
Object Caching: Formatted inventory objects may be cached when processing large inventories with repeated filter operations to improve response times.
Scalability Considerations: Processors implement streaming parsing for large inventories, pagination support for query results, and memory-efficient object creation patterns to handle documentation sites of varying sizes.
Processor-Provided Formatters System¶
Self-Contained Object Approach¶
The processor-provided formatters design implements self-contained inventory objects
where each inventory processor creates objects that provide formatting intelligence
for their own specifics fields. This approach co-locates domain knowledge with
the processors that create it, making the system truly extensible and maintainable.
Core Principle: Each processor knows best how to present its own data. Sphinx
processors understand domain, role, and priority semantics. MkDocs
processors understand content_preview and page-based organization. Other
processors have their own field semantics that cannot be predicted centrally.
Architectural Foundation: - Self-Contained Objects: Inventory objects provide their own rendering methods without external dependencies - Domain Knowledge Co-location: Objects understand their own field semantics and presentation requirements - Extensibility Without Core Changes: New inventory processors create objects that inherently know how to render themselves
Domain Knowledge Co-location¶
Domain knowledge remains with the processors and objects that understand the data:
Format Expertise: Processors understand their source format’s semantics and conventions
Presentation Logic: Objects know how to render their specific data appropriately
Evolution Together: Data structures and presentation logic evolve in tandem
No External Dependencies: Objects render themselves without requiring external formatting registries
Interface Specifications¶
The InventoryObject class provides self-formatting capabilities through methods
that each processor implements to render format-specific data. See the
results-module-design document for complete interface specifications.
class InventoryObject( __.immut.DataclassObject ):
''' Universal inventory object with self-formatting capabilities. '''
def render_specifics_markdown(
self, /, *,
show_technical: __.typx.Annotated[ bool, __.ddoc.Doc( '...' ) ] = True
) -> tuple[ str, ... ]:
''' Renders specifics as Markdown lines for CLI display. '''
def render_specifics_json( self ) -> dict[ str, __.typx.Any ]:
''' Renders specifics as JSON-serializable dictionary. '''
CLI and JSON Integration Patterns¶
The CLI layer integrates with self-formatting objects through standardized interfaces:
# CLI integration signatures
def _append_inventory_metadata(
lines: __.cabc.MutableSequence[ str ],
inventory_object: __.cabc.Mapping[ str, __.typx.Any ]
) -> None:
''' Appends inventory metadata using object self-formatting. '''
def _append_content_description(
lines: __.cabc.MutableSequence[ str ],
document: __.cabc.Mapping[ str, __.typx.Any ],
inventory_object: __.cabc.Mapping[ str, __.typx.Any ],
) -> None:
''' Appends content description with standard fallbacks. '''
Serialization supports self-formatting objects:
# Serialization signatures
def serialize_for_json( obj: __.typx.Any ) -> __.typx.Any:
''' Serialization supporting self-formatting objects. '''
def _serialize_dataclass_for_json( obj: __.typx.Any ) -> dict[ str, __.typx.Any ]:
''' Serializes dataclass objects using render_specifics_json when available. '''
Example Implementation Patterns¶
Each processor creates objects that understand format-specific rendering:
Sphinx-specific rendering: Sphinx inventory objects implement rendering that shows role and domain information directly, uses Sphinx terminology that users understand, and includes source attribution and priority when technical details are requested.
MkDocs-specific rendering: MkDocs inventory objects implement rendering that emphasizes document/page nature, shows navigation context and page hierarchy when available, and consistently displays document type and page structure.
Detection and Discovery¶
Detection Interface Contracts¶
All inventory processors implement standardized detection interfaces that provide consistent behavior across different inventory formats:
class InventoryDetection( Detection ):
''' Base class for inventory processor detection. '''
@__.typx.abc.abstractmethod
async def detect_async(
self,
location: str, /, *,
auxdata: __.state.Globals
) -> DetectionResult:
''' Detects inventory availability with confidence scoring. '''
@__.typx.abc.abstractmethod
def format_inventory_object(
self,
source_data: __.typx.Any,
location_url: str, /, *,
auxiliary_data: __.typx.Optional[ __.typx.Any ] = None,
) -> InventoryObject:
''' Formats source data into inventory object with self-formatting capabilities. '''
Detection Contract: Async detection returning confidence-scored results with optional caching of preliminary inventory data for performance optimization.
Object Creation Contract: Unified object creation interface that converts format-specific source data into universal inventory objects with complete attribution and self-formatting capabilities.
Confidence Scoring Methodology¶
Confidence scoring provides consistent assessment of inventory source quality and processor compatibility:
Scoring Factors: - Structural Validity: Well-formed inventory data matching expected format patterns - Content Quality: Sufficient object count and metadata richness for useful operations - Format Indicators: Clear markers indicating the expected inventory format - Accessibility: Reliable access to inventory data without errors or restrictions
Consistency Requirements: All processors use equivalent confidence scales and assessment criteria to ensure reliable processor selection across different inventory formats.
Calibration Standards: Regular validation against known good and problematic inventory sources ensures confidence scores remain meaningful and comparable.
Processor Selection Patterns¶
The detection system provides optimal processor selection based on confidence scores and capability matching:
Selection Algorithm: 1. Confidence Ranking: Primary selection based on detection confidence scores 2. Capability Matching: Secondary filtering based on required operation capabilities 3. Performance Characteristics: Consideration of processor performance profiles 4. Precedence Rules: Explicit precedence handling for overlapping processor capabilities
Multi-Processor Scenarios: When multiple processors detect inventory sources, the system applies consistent selection logic while maintaining user experience predictability.
Cache Integration Strategy¶
Caching strategy optimizes performance while maintaining data freshness:
Detection Result Caching: Confidence-scored detection results cached with TTL management to avoid repeated expensive detection operations.
Preliminary Data Caching: Detection processes may cache preliminary inventory data when it can be reused for subsequent processing operations.
Cache Invalidation: TTL expiration and explicit invalidation triggers ensure cached data remains current with source changes.
Memory Management: Cache size limits and LRU eviction policies prevent memory exhaustion during extended operation periods.
Error Handling for Detection Failures¶
Robust error handling ensures graceful degradation when detection fails:
Error Categories: - Network Errors: Connection failures, timeouts, DNS resolution problems - Authentication Errors: Permission denied, credential failures, access restrictions - Format Errors: Unexpected inventory structure, parsing failures, version incompatibilities - Resource Errors: Memory exhaustion, disk space issues, system resource limitations
Recovery Strategies: Automatic retry with exponential backoff, graceful degradation to alternative processors, and comprehensive error logging for debugging support.
Base Interfaces and Protocols¶
InventoryDetection Abstract Base Class¶
The InventoryDetection abstract base class provides the foundation for all
inventory processor implementations:
class InventoryDetection( Detection ):
''' Base class providing unified inventory processor interface. '''
@property
@__.typx.abc.abstractmethod
def processor_class( self ) -> type[ InventoryProcessor ]:
''' Returns the processor class for this detection result. '''
@property
@__.typx.abc.abstractmethod
def capabilities( self ) -> __.immut.Dictionary[ str, __.typx.Any ]:
''' Returns processor capability information. '''
Universal Interface Contracts¶
All inventory processors implement identical interface contracts to ensure consistent behavior and interoperability:
Detection Interface: Standardized async detection with confidence scoring, capability advertisement, and optional preliminary data caching.
Processing Interface: Consistent inventory acquisition, query operations, and filtering capabilities across all processor implementations.
Object Creation Interface: Unified object formatting method signatures that create self-formatting inventory objects with complete source attribution.
Core Processing Methods¶
All inventory processors implement standardized processing methods:
class InventoryProcessor( __.abc.ABC ):
''' Base class for inventory processors. '''
@__.typx.abc.abstractmethod
async def query_inventory(
self,
term: __.Absential[ str ] = __.absent, *,
filters: __.cabc.Mapping[ str, __.typx.Any ] = __.immut.Dictionary( ),
details: __.InventoryQueryDetails = __.InventoryQueryDetails.Documentation,
results_max: int = 1000,
) -> tuple[ InventoryObject, ... ]:
''' Returns inventory objects matching search and filter criteria.
When term is absent and filters are empty or trivial,
returns complete inventory (equivalent to acquire_inventory).
When term is present or filters contain constraints,
returns filtered subset limited by results_max.
'''
Contract Specifications:
- query_inventory serves dual purpose: complete inventory retrieval and filtering
- Absent term with empty/trivial filters returns entire inventory
- Present term or non-trivial filters return matching subset limited by results_max
- Search and filtering occur at processor level using format-specific knowledge
- Results include both structural filtering and name-based search capabilities
format_inventory_object Unified Signature¶
The unified format_inventory_object signature ensures consistent object creation
across all processor implementations:
@__.typx.abc.abstractmethod
def format_inventory_object(
self,
source_data: __.typx.Any,
location_url: str, /, *,
auxiliary_data: __.typx.Optional[ __.typx.Any ] = None,
) -> InventoryObject:
''' Formats source data into inventory object with self-formatting capabilities.
Args:
source_data: Format-specific source data (Sphinx object, MkDocs document, etc.)
location_url: Complete URL to inventory location for attribution
auxiliary_data: Additional context data (inventory metadata, etc.)
'''
Parameter Standardization: Consistent parameter names, types, and semantics across all processor implementations eliminate interface confusion.
Type Safety: Strong typing ensures compile-time validation of processor implementations and caller code.
Extensibility: Optional auxiliary data parameter provides extension point for processor-specific enhancements without breaking interface compatibility.
Capability Advertisement Patterns¶
Processors advertise their capabilities through standardized metadata:
class ProcessorCapabilities( __.immut.DataclassObject ):
''' Processor capability advertisement. '''
supported_inventory_types: frozenset[ str ]
supported_filters: frozenset[ str ]
performance_characteristics: __.immut.Dictionary[ str, __.typx.Any ]
operational_constraints: __.immut.Dictionary[ str, __.typx.Any ]
Capability Discovery: Dynamic capability discovery enables system adaptation to available processors and their operational characteristics.
Filter Advertisement: Processors advertise supported filter types, enabling validation of user requests before processing begins.
Performance Profiles: Capability information includes performance characteristics for operation planning and resource allocation.
Validation and Type Safety¶
Strong validation ensures system reliability and provides clear error feedback:
Interface Validation: Compile-time and runtime validation of processor implementations against abstract base class contracts.
Data Validation: Multi-stage validation from raw inventory data through final object creation with detailed error context.
Type Safety: Comprehensive type annotations enable static analysis and provide clear interface contracts for processor implementers.
Error Propagation: Structured error handling with detailed context information supports debugging and system monitoring.
Implementation Outline¶
Processor-Specific Data Source Handling Patterns¶
Inventory processors handle diverse data source formats through specialized parsing and validation strategies:
Data Source Diversity: Processors accommodate various inventory formats including binary files, JSON documents, XML structures, and custom text formats.
Parsing Strategies: Format-appropriate parsing techniques including streaming parsers for large files, validation schemas for structured data, and error recovery mechanisms for malformed inputs.
Performance Optimization: Memory-efficient processing techniques including lazy loading, incremental parsing, and selective data extraction based on query requirements.
Format-Specific Object Creation Strategies¶
Object creation strategies vary by inventory format while maintaining universal output consistency:
Metadata Normalization: Translation of format-specific metadata into universal object fields while preserving format-specific details in structured containers.
Attribution Strategies: Consistent source attribution patterns that capture complete provenance information including processor type, source location, and format-specific identifiers.
Self-Formatting Integration: Object creation includes formatting method implementation that understands format-specific semantics and presentation requirements.
Detection Methodology and Validation Approaches¶
Detection implementations use format-appropriate validation and confidence assessment techniques:
Probe Strategies: Sequential or parallel probing of standard and alternative inventory locations using format-specific URL patterns.
Validation Criteria: Format-appropriate structural validation including schema compliance, content quality assessment, and compatibility verification.
Confidence Calibration: Consistent confidence scoring based on validation results, content quality metrics, and format-specific quality indicators.
Content Integration and Search Patterns¶
Integration with search and content systems through standardized interfaces:
Search Integration: Universal object interfaces enable format-agnostic search operations while preserving format-specific search capabilities through metadata containers.
Content Coordination: Capability-based filtering ensures inventory objects are only processed by compatible structure processors for content extraction.
Multi-Source Coordination: Source attribution enables tracking and coordination across multiple inventory sources for comprehensive documentation coverage.
Performance Optimization Strategies¶
Performance optimization approaches tailored to inventory processing characteristics:
Caching Strategies: Multi-level caching including detection results, raw inventory data, and formatted objects with appropriate TTL management.
Lazy Loading: Deferred processing of inventory data until required by specific operations to minimize initial load times.
Batch Processing: Efficient batch operations for large inventory processing tasks with memory management and progress tracking.
Scalability and Extension Considerations¶
Design patterns support system scalability and future enhancement:
Memory Management: Bounded memory usage through streaming processing, pagination, and selective data loading based on operational requirements.
Processor Extensibility: Clear extension points for new inventory formats through abstract base class implementation and capability advertisement.
Configuration Management: Flexible configuration systems supporting processor-specific parameters and operational tuning.
Example Implementation Skeletons¶
Sphinx Processor Outline:
- objects.inv binary file handling with decompression and parsing
- Domain/role semantic understanding for object categorization
- Priority-based object ranking and presentation
- Cross-reference resolution for documentation linking
- Theme-independent inventory processing
MkDocs Processor Outline:
- search_index.json file handling with page-level extraction
- Content preview generation from embedded text
- Navigation context extraction from page hierarchy
- Alternative format support for theme-specific variations
- Hybrid content strategy coordination
Extension Points and Future Processors¶
Plugin Architecture Patterns¶
Consistent processor interfaces enable third-party inventory processors through well-defined extension patterns:
Interface Compliance: New processors implement standard abstract base classes with consistent method signatures and behavioral contracts.
Capability Integration: Processor capability advertisement enables system integration without core code modifications.
Registration Mechanisms: Dynamic processor discovery and registration through plugin management systems or configuration-based registration.
Custom Processor Development¶
Clear development patterns support custom inventory processor creation:
Development Guidelines: Comprehensive documentation of interface requirements, performance expectations, and integration patterns.
Testing Frameworks: Standardized testing patterns and validation suites for processor development and verification.
Reference Implementations: Well-documented reference processors demonstrate implementation patterns and best practices.
Capability Evolution Support¶
System design accommodates processor capability enhancement over time:
Backward Compatibility: Interface evolution strategies that maintain compatibility with existing processors while enabling enhanced functionality.
Capability Versioning: Version management for processor capabilities enabling gradual system enhancement and feature adoption.
Feature Negotiation: Dynamic feature negotiation between system components based on advertised processor capabilities.
Performance Optimization Strategies¶
Extension points support continued performance optimization:
Custom Caching: Processor-specific caching strategies optimized for particular inventory formats and access patterns.
Parallel Processing: Opportunities for parallel inventory processing with appropriate synchronization and coordination mechanisms.
Resource Management: Adaptive resource allocation based on processor characteristics and operational requirements.
This inventory processor architecture provides a comprehensive foundation for format-agnostic inventory operations while maintaining clean separation between universal interfaces and format-specific implementations. The design supports extensibility, performance optimization, and consistent user experience across diverse documentation source formats.