002. Search Architecture Separation¶

Status¶

Accepted

Context¶

The system requires comprehensive search capabilities across multiple documentation formats with different structural characteristics. The original design had duplicate search logic in each processor, leading to:

Code duplication across Sphinx and MkDocs processors
Inconsistent search behavior between different documentation formats
Maintenance overhead when updating search algorithms or adding new match modes
Coupling between search logic and format-specific data extraction

Key forces driving this decision:

Search quality should be consistent regardless of documentation format
Adding new processors should not require reimplementing search algorithms
Search improvements should benefit all supported formats simultaneously
Format-specific knowledge should remain in processors while search logic is universal

The system needs to support multiple search modes: - Exact string matching for precise queries - Regex pattern matching for complex searches - Fuzzy matching with configurable thresholds for approximate searches

Decision¶

Implement a layered search architecture with clear separation between universal search logic and processor-specific data extraction:

Universal Search Layer (search.py): - Centralized search algorithms using rapidfuzz for fuzzy matching - Support for exact, regex, and fuzzy matching modes with unified interface - Consistent scoring and ranking algorithms across all processors - Structured SearchResult objects with match metadata and scoring

Processor Responsibility Separation: - Processors handle format-specific data extraction and filtering - Processors apply domain/role/priority filters using format-specific knowledge - Universal search layer applies name matching and ranking - Clear handoff points between extraction and search phases

Search Flow Architecture: 1. Functions layer receives user query with search parameters 2. Processor layer extracts and filters objects using format-specific logic 3. Universal search layer applies name matching via search.filter_by_name() 4. Processor layer fetches full documentation content for top candidates 5. Functions layer formats and returns results with consistent structure

Alternatives¶

Alternative 1: Processor-Specific Search Implementations - Rejected because it leads to code duplication and inconsistent behavior - Would require each new processor to reimplement search algorithms - Makes it difficult to improve search quality across all formats - Results in maintenance overhead when updating search logic

Alternative 2: Search-Specific Processor Wrappers - Considered but rejected due to added complexity - Would create additional abstraction layer without clear benefits - Could lead to unclear responsibility boundaries - Doesn’t address the core duplication issue

Alternative 3: Hybrid Search with Format-Specific Extensions - Rejected as over-engineered for current requirements - Would allow format-specific search enhancements but adds complexity - Could be reconsidered if strong format-specific search needs emerge

Alternative 4: External Search Service - Rejected as inappropriate for the deployment model - Would require additional infrastructure and network dependencies - Conflicts with goal of standalone operation

Consequences¶

Positive Consequences:

Consistency: Identical search behavior across all documentation formats
Maintainability: Single location for search algorithm improvements
Extensibility: New processors get full search capabilities automatically
Performance: Optimized search algorithms benefit all formats
Quality: Centralized scoring enables sophisticated relevance ranking

Negative Consequences:

Limited format-specific optimization: Cannot easily customize search for format characteristics
Abstraction overhead: Additional layer between processors and search logic
Testing complexity: Must verify search behavior across all processor types

Implementation Impacts:

Processors must implement consistent data extraction interfaces
Search layer must handle varying object metadata formats
Functions layer coordinates between processors and search components
Result formatting must accommodate different processor output structures

Migration Benefits:

Eliminated duplicate search code from existing processors
Unified search interface simplifies adding new match modes
Consistent API enables easier testing and validation
Improved search quality through dedicated optimization focus

Future Flexibility:

Architecture supports adding new search modes without processor changes
Enables sophisticated ranking algorithms based on multiple factors
Allows for search analytics and optimization without format coupling
Provides foundation for potential machine learning enhanced search