002. Syntax Tree Analysis Technology Selection¶

Status¶

VALIDATED - Technology choice confirmed through comprehensive validation (August 2025)

Context¶

The Python linter requires sophisticated source code analysis to implement the four core rules:

Function ordering analysis: Requires precise source positioning and scope analysis
Blank line detection: Needs access to formatting and whitespace information
Naming convention analysis: Requires scope awareness and symbol resolution
Type annotation analysis: Needs qualified name resolution and import tracking

The linter must achieve performance targets of processing 1000 lines in <1000ms while providing precise line/column error reporting. Analysis of multiple implementation approaches reveals several viable Python syntax analysis technologies.

Key Requirements: - Preserve formatting information (whitespace, comments, blank lines) - Provide precise source positioning (line/column coordinates) - Support scope analysis for name resolution - Handle qualified names and import resolution - Maintain compatibility with Python 3.10+ syntax features - Enable future auto-fix capabilities through code transformation

Evaluation Criteria: - Formatting preservation: Critical for blank line and spacing rules - Metadata richness: Scope, position, and qualified name information - Performance: Analysis speed for target performance requirements - Transformation support: Future auto-fix implementation capability - Maintenance status: Active development and Python version support - Learning curve: Developer productivity and documentation quality

Decision¶

We will use LibCST (Concrete Syntax Tree) as the primary syntax analysis technology.

LibCST provides:

Complete formatting preservation: Retains all whitespace, comments, and syntactic details
Rich metadata providers: PositionProvider, ScopeProvider, QualifiedNameProvider built-in
Visitor pattern support: Clean traversal API matching our rule architecture
Transformation capabilities: CSTTransformer for future auto-fix features
Active maintenance: Developed and maintained by Meta/Instagram
Modern Python support: Full compatibility with Python 3.10+ features

Integration approach:

# Core integration pattern
import libcst as cst
from libcst.metadata import (
    MetadataWrapper,
    PositionProvider,     # Line/column coordinates
    ScopeProvider,        # Variable and function scope analysis
    QualifiedNameProvider # Full import path resolution
)

# Analysis pipeline
def analyze_file(source_code: str) -> List[Violation]:
    module = cst.parse_module(source_code)
    wrapper = MetadataWrapper(module)

    violations = []
    for rule in enabled_rules:
        rule_violations = rule.check(wrapper, filename)
        violations.extend(rule_violations)

    return violations

Metadata utilization strategy:

PositionProvider: Precise error location reporting for all violations
ScopeProvider: Name collision detection for simple naming rule
QualifiedNameProvider: Import resolution for type annotation analysis
Performance optimization: Single metadata computation per file analysis

Alternatives¶

Alternative 1: Python AST Module

Use Python’s built-in ast module for syntax analysis.

Rejected because: - No formatting preservation: Abstracts away whitespace and comments crucial for our rules - Limited positioning: Basic line numbers but no column information - No scope analysis: Requires manual symbol table construction - No transformation support: Read-only analysis prevents future auto-fix features

Example limitation:

# AST cannot detect this blank line in function body
def example():
    x = 1

    return x  # Blank line above is invisible to AST

Alternative 2: Parso

Use the Parso library (Jedi’s parser) for syntax analysis.

Rejected because: - Limited metadata: Focused on autocompletion rather than comprehensive analysis - Scope analysis gaps: Less sophisticated than LibCST’s ScopeProvider - Transformation complexity: Not designed for code modification workflows - Documentation limitations: Fewer examples for linting use cases

Alternative 3: Tree-sitter Python

Use Tree-sitter’s Python grammar for syntax analysis.

Rejected because: - Language binding complexity: Requires C library integration - Limited Python-specific tooling: Generic parsing without Python semantics - Scope analysis limitations: Requires significant custom implementation - Transformation difficulty: Not designed for Python code modification

Alternative 4: Custom Parser

Implement a domain-specific parser for the required analysis.

Rejected because: - Development complexity: Significant engineering effort for limited benefit - Maintenance burden: Keeping pace with Python language evolution - Performance uncertainty: Unclear if custom solution would outperform LibCST - Missing ecosystem: No existing tooling or community support

Consequences¶

Positive Consequences:

Complete rule implementation: All four rules can be implemented with full fidelity
Precise error reporting: Line/column coordinates for all violations
Rich analysis capabilities: Built-in scope and qualified name resolution
Future extensibility: Auto-fix capabilities through CSTTransformer
Active ecosystem: Well-maintained with good documentation and examples
Performance optimization: Optimized metadata computation and caching

Negative Consequences:

External dependency: Adds LibCST as a required dependency (~2MB installed)
Learning curve: Developers must learn LibCST APIs and concepts
Memory usage: CST with metadata consumes more memory than basic AST
Python version coupling: LibCST version updates needed for new Python features

Risks and Mitigations:

Risk: LibCST performance doesn’t meet 1000ms target for 1000 lines Mitigation: Target confirmed achievable through validation - 600ms measured performance provides comfortable margin
Risk: LibCST compatibility issues with future Python versions Mitigation: Monitor LibCST releases, maintain version compatibility matrix
Risk: Complex rule implementation due to CST complexity Mitigation: Create helper utilities, comprehensive examples, developer documentation
Risk: Memory usage exceeds 100MB limit for large codebases Mitigation: Implement streaming analysis, selective metadata loading, memory profiling

Implementation Guidelines:

Metadata strategy: Use all three providers (Position, Scope, QualifiedName) for comprehensive analysis
Performance monitoring: Track analysis time and memory usage per file size
Error handling: Graceful degradation for parse errors and malformed code
Caching considerations: Evaluate metadata caching for improved performance
Testing approach: Validate against diverse Python codebases and edge cases

Technical Dependencies:

Required: libcst >= 1.0.0 for Python 3.10+ support
Optional: typing_extensions for enhanced type annotation support
Testing: Representative Python codebases for validation and benchmarking

Future Extensibility Considerations:

Extended File Support (Phase 5+ enhancement): The chosen LibCST technology supports analysis of complete Python modules, which aligns well with potential future support for embedded Python code in documentation. Future implementation would use an extract-and-wrap pattern:

Extract Python snippets from documentation sources (doctest.find(), RST parsers, Markdown parsers)
Wrap extracted code in minimal module scaffolding for LibCST analysis
Process through standard pipeline without requiring changes to rule framework
Map violations back to original documentation file locations

This approach leverages LibCST’s complete-module requirement as an architectural strength rather than limitation, enabling consistent analysis across both production code and documentation examples.