# File Discovery and Processing Pipeline Design This document specifies the file discovery and processing pipeline system that connects the file system to the existing linter core framework. The design provides source file discovery with ignore pattern integration and parallel processing. ## Design Philosophy The discovery system provides essential file-to-analysis bridging functionality: **Architecture Goal**: Simple function-based interface that feeds filtered source files to existing `Engine.lint_files()` method **Integration Focus**: Build on validated linter core interfaces without requiring changes to rule execution **Performance Focus**: Parallel processing with configurable worker counts for scalability ## Essential Interface ### Core Discovery Functions The file discovery system provides three essential functions: ``` python from . import __ # Type aliases for interface clarity Location: __.typx.TypeAlias = __.pathlib.Path IgnorePatterns: __.typx.TypeAlias = __.cabc.Sequence[ str ] def discover_files( anchors: __.typx.Annotated[ __.cabc.Sequence[ Location ], __.ddoc.Doc( 'Root directories for source file discovery.' ) ], extensions: __.typx.Annotated[ tuple[ str, ... ], __.ddoc.Doc( 'File extensions to treat as source files.' ) ] = ( '.py', '.pyi' ), ignore_patterns: IgnorePatterns = ( ), ignore_files: __.typx.Annotated[ tuple[ str, ... ], __.ddoc.Doc( 'Ignore file names to search for during traversal.' ) ] = ( '.gitignore', ), ) -> __.typx.Annotated[ tuple[ Location, ... ], __.ddoc.Doc( 'Source files found and filtered for linter processing.' ), ]: ''' Discovers source files from anchor locations with ignore pattern filtering. ''' def lint_discovered_files( locations: __.typx.Annotated[ __.cabc.Sequence[ Location ], __.ddoc.Doc( 'Source file locations to process through vibelinter.' ) ], engine: __.typx.Annotated[ __.engine.Engine, __.ddoc.Doc( 'Configured linter engine for file analysis.' ) ], concurrency: __.typx.Annotated[ int, __.ddoc.Doc( 'Files to process concurrently (1 = sequential).' ) ] = 4, continue_on_errors: __.typx.Annotated[ bool, __.ddoc.Doc( 'Whether to continue processing after individual file failures.' ) ] = True, ) -> __.typx.Annotated[ tuple[ __.reporting.Report, ... ], __.ddoc.Doc( 'Diagnostic reports from successful file processing.' ), ]: ''' Processes files through linter engine with parallel execution and error isolation. ''' def discover_and_lint( anchors: __.cabc.Sequence[ Location ], engine: __.engine.Engine, extensions: tuple[ str, ... ] = ( '.py', '.pyi' ), ignore_patterns: IgnorePatterns = ( ), concurrency: int = 4, ) -> __.typx.Annotated[ tuple[ __.reporting.Report, ... ], __.ddoc.Doc( 'Complete pipeline results from file discovery through linting.' ), ]: ''' Convenience function combining file discovery and linting in single operation. ''' ``` ## Implementation Architecture ### File Detection Strategy Source file detection uses multiple strategies for comprehensive coverage: ``` python from . import __ def detect_source_file( location: Location, extensions: __.cabc.Sequence[ str ] ) -> bool: ''' Detects source files by extension and shebang analysis. ''' def analyze_shebang( location: Location ) -> bool: ''' Analyzes file shebang to detect appropriate interpreter usage. ''' ``` ### Ignore Pattern Integration The filtering system supports multiple ignore file formats with extensible pattern matching: ``` python from . import __ def collect_ignore_patterns( anchor: Location, ignore_files: __.cabc.Sequence[ str ], ) -> tuple[ str, ... ]: ''' Collects ignore patterns from ignore files in location hierarchy. ''' def should_ignore( location: Location, patterns: __.cabc.Sequence[ str ], ) -> bool: ''' Checks if location should be ignored based on patterns using glob-style matching. ''' ``` ### Parallel Processing Framework The processing system provides configurable parallel execution with error isolation: ``` python from . import __ def process_files( locations: __.cabc.Sequence[ Location ], processor: __.cabc.Callable[ [ Location ], __.reporting.Report ], concurrency: int, continue_on_errors: bool, ) -> tuple[ __.reporting.Report, ... ]: ''' Processes files using parallel execution with error handling and isolation. ''' ``` ## Error Handling Design ### Exception Hierarchy Simple exception hierarchy for discovery and processing failures: ``` python from . import __ class FileDiscoverFailure( __.Omnierror ): ''' Raised when file discovery encounters unrecoverable errors during traversal. ''' ``` ## Module Organization ### Minimal Module Structure The discovery framework uses minimal module organization following established patterns: ``` sources/vibelinter/discovery/ ├── __.py # Discovery imports ├── __init__.py # Package entry point ├── functions.py # Core discovery and processing functions └── utilities.py # File detection and filtering utilities ``` ### Import Organization Import structure following established patterns: ``` python # sources/vibelinter/discovery/__.py from ..__ import * # sources/vibelinter/discovery/__init__.py from . import __ from .functions import discover_files, lint_discovered_files, discover_and_lint ``` ## Design Validation ### Framework Integration Verification The design integrates seamlessly with existing architectural components: **Engine Integration:** - Provides [tuple\[Path, \...\]]{.title-ref} that integrates directly with [Engine.lint_files()]{.title-ref} - No changes required to existing linter core interfaces - Clean separation between discovery and analysis concerns **Practices Compliance:** - Wide parameter types ([\_\_.cabc.Sequence]{.title-ref}) for flexible input interfaces - Narrow return types ([tuple]{.title-ref}) for concrete results - Proper [\_\_.typx.Annotated]{.title-ref} patterns with [\_\_.ddoc.Doc]{.title-ref} documentation - Function signatures follow established spacing and bracket conventions - Type aliases for complex reused types ([Location]{.title-ref}, [IgnorePatterns]{.title-ref}) **Configuration Integration:** - Simple parameter-based configuration avoiding complex configuration objects - Extensible ignore pattern system supporting [.gitignore]{.title-ref}, [.hgignore]{.title-ref}, and custom formats - Configurable parallel processing adapting to different use cases **Performance Characteristics:** - Parallel processing with configurable worker counts - Error isolation preventing individual file failures from stopping batch processing - Early filtering during traversal minimizing unnecessary file system operations **Implementation Readiness:** The design provides complete interface specifications that: - Build directly on existing Engine interfaces without modification - Support both interactive CLI usage and programmatic integration - Scale from individual files to large multi-package codebases with parallel processing - Provide robust error handling while maintaining architectural simplicity ## File Discovery vs. Engine Processing Architectural Separation ### Design Decision Rationale The file discovery system maintains deliberate separation from Engine processing to ensure clear architectural boundaries: **File Discovery System Responsibilities:** - Path resolution and expansion from user-provided patterns - .gitignore integration and custom ignore pattern processing - File system traversal with configurable filtering - Error handling for path access and permission issues - Parallel file enumeration for performance optimization **Engine Processing System Responsibilities:** - Python source code parsing and AST construction - Rule execution and violation collection - LibCST metadata provider coordination - Analysis performance optimization and timing - Rule-specific error handling and reporting **Separation Rationale:** - **Single Responsibility Principle**: Discovery determines \"which files to analyze,\" Engine determines \"how to analyze files\" - **Error Handling Isolation**: Discovery handles file system errors (permissions, missing files), Engine handles parsing and analysis errors - **Performance Boundaries**: Discovery optimizes file system operations, Engine optimizes analysis execution - **Testing Isolation**: Discovery tested with file system mocks, Engine tested with code samples - **Scalability**: Discovery can be parallelized independently from Engine analysis patterns **Interface Design:** The separation maintains clean interfaces where Discovery produces [tuple\[Path, \...\]]{.title-ref} that integrates directly with [Engine.lint_files()]{.title-ref} without requiring Engine knowledge of file system concerns or Discovery knowledge of analysis patterns. This architectural separation enables independent optimization and evolution of file system operations versus code analysis execution while maintaining clear integration points. This file discovery design bridges the file system and linter core framework while maintaining architectural simplicity and practices compliance, with extensibility for future source file types including documentation-embedded code snippets.