002. Search Architecture Separation

Status

Accepted

Context

The system requires comprehensive search capabilities across multiple documentation formats with different structural characteristics. The original design had duplicate search logic in each processor, leading to:

  • Code duplication across Sphinx and MkDocs processors

  • Inconsistent search behavior between different documentation formats

  • Maintenance overhead when updating search algorithms or adding new match modes

  • Coupling between search logic and format-specific data extraction

Key forces driving this decision:

  • Search quality should be consistent regardless of documentation format

  • Adding new processors should not require reimplementing search algorithms

  • Search improvements should benefit all supported formats simultaneously

  • Format-specific knowledge should remain in processors while search logic is universal

The system needs to support multiple search modes: - Exact string matching for precise queries - Regex pattern matching for complex searches - Fuzzy matching with configurable thresholds for approximate searches

Decision

Implement a layered search architecture with clear separation between universal search logic and processor-specific data extraction:

Universal Search Layer (search.py): - Centralized search algorithms using rapidfuzz for fuzzy matching - Support for exact, regex, and fuzzy matching modes with unified interface - Consistent scoring and ranking algorithms across all processors - Structured SearchResult objects with match metadata and scoring

Processor Responsibility Separation: - Processors handle format-specific data extraction and filtering - Processors apply domain/role/priority filters using format-specific knowledge - Universal search layer applies name matching and ranking - Clear handoff points between extraction and search phases

Search Flow Architecture: 1. Functions layer receives user query with search parameters 2. Processor layer extracts and filters objects using format-specific logic 3. Universal search layer applies name matching via search.filter_by_name() 4. Processor layer fetches full documentation content for top candidates 5. Functions layer formats and returns results with consistent structure

Alternatives

Alternative 1: Processor-Specific Search Implementations - Rejected because it leads to code duplication and inconsistent behavior - Would require each new processor to reimplement search algorithms - Makes it difficult to improve search quality across all formats - Results in maintenance overhead when updating search logic

Alternative 2: Search-Specific Processor Wrappers - Considered but rejected due to added complexity - Would create additional abstraction layer without clear benefits - Could lead to unclear responsibility boundaries - Doesn’t address the core duplication issue

Alternative 3: Hybrid Search with Format-Specific Extensions - Rejected as over-engineered for current requirements - Would allow format-specific search enhancements but adds complexity - Could be reconsidered if strong format-specific search needs emerge

Alternative 4: External Search Service - Rejected as inappropriate for the deployment model - Would require additional infrastructure and network dependencies - Conflicts with goal of standalone operation

Consequences

Positive Consequences:

  • Consistency: Identical search behavior across all documentation formats

  • Maintainability: Single location for search algorithm improvements

  • Extensibility: New processors get full search capabilities automatically

  • Performance: Optimized search algorithms benefit all formats

  • Quality: Centralized scoring enables sophisticated relevance ranking

Negative Consequences:

  • Limited format-specific optimization: Cannot easily customize search for format characteristics

  • Abstraction overhead: Additional layer between processors and search logic

  • Testing complexity: Must verify search behavior across all processor types

Implementation Impacts:

  • Processors must implement consistent data extraction interfaces

  • Search layer must handle varying object metadata formats

  • Functions layer coordinates between processors and search components

  • Result formatting must accommodate different processor output structures

Migration Benefits:

  • Eliminated duplicate search code from existing processors

  • Unified search interface simplifies adding new match modes

  • Consistent API enables easier testing and validation

  • Improved search quality through dedicated optimization focus

Future Flexibility:

  • Architecture supports adding new search modes without processor changes

  • Enables sophisticated ranking algorithms based on multiple factors

  • Allows for search analytics and optimization without format coupling

  • Provides foundation for potential machine learning enhanced search