005. Validation Behavior Configuration

Status

Accepted

Context

The v1.x functional approach provides no control over validation execution, leading to inappropriate validation overhead and inflexible error handling for different use cases. Analysis of integration patterns revealed that validation requirements vary significantly based on context:

Performance-Critical Scenarios: Quick charset detection for decoding workflows should skip expensive printable character analysis.

Security-Sensitive Contexts: Comprehensive validation including trial decoding and character analysis required to prevent processing of malicious content.

Batch Processing Workflows: Different validation thresholds appropriate for automated processing versus interactive validation.

Current Limitations:

  • All validation logic hardcoded with no runtime configuration

  • No ability to skip expensive validations for performance-critical paths

  • Fixed printable character thresholds inappropriate for all content types

  • Trial decoding always performed regardless of use case requirements

Requirements Analysis:

  • Selective Validation: Control which validation steps execute

  • Configurable Thresholds: Adjust validation parameters for different content types

  • Performance Control: Skip expensive operations when not required

  • Default Behavior: Zero-configuration defaults for common use cases

  • Backward Compatibility: Existing behavior preserved as default

Decision

We will implement a Behaviors Configuration Pattern that provides fine-grained control over validation execution through a structured configuration object.

Evolved Configuration Design:

class BehaviorTristate(enum.Enum):
    Never = enum.auto()
    AsNeeded = enum.auto()
    Always = enum.auto()

class Behaviors(immut.Dataclass):
    # Core detection controls
    charset_detect: BehaviorTristate = BehaviorTristate.AsNeeded
    mimetype_detect: BehaviorTristate = BehaviorTristate.AsNeeded

    # Charset handling sophistication
    charset_promotions: Mapping[str, str] = {'ascii': 'utf-8'}
    charset_trial_codecs: Sequence[str | CodecSpecifiers] = (
        CodecSpecifiers.Inference, CodecSpecifiers.UserDefault)
    charset_trial_decode: BehaviorTristate = BehaviorTristate.AsNeeded

BehaviorTristate Control:

  • Never: Skip behavior entirely for maximum performance

  • AsNeeded: Apply behavior based on detection confidence and context (default)

  • Always: Force behavior regardless of confidence or context

Advanced Charset Handling:

  • charset_promotions: Mapping for upgrading detected charsets (e.g., ASCII→UTF-8)

  • charset_trial_codecs: Sequence of codecs to try during trial decoding

  • CodecSpecifiers: Enum for dynamic codec resolution (Inference, OsDefault, UserDefault)

Sophisticated Detection Control:

  • charset_detect: Controls when charset detection from content occurs

  • mimetype_detect: Controls when MIME type detection from content occurs

  • charset_trial_decode: Controls when trial decoding validation occurs

Integration Pattern:

def detect_mimetype_charset(
    content: Content,
    location: Absential[Location] = absent, *,
    behaviors: Absential[Behaviors] = absent,
    # ... other parameters
) -> tuple[Absential[str], Absential[str]]:

Default Behavior Design:

BEHAVIORS_DEFAULT = Behaviors(
    trial_decode='as-needed',
    validate_printable='as-needed',
    printable_threshold=0.0,
    assume_utf8_superset=True,
)

Alternatives

Individual Boolean Parameters

Benefits: Simple parameter interface, clear enable/disable semantics Drawbacks: Parameter proliferation, no structured configuration Rejection Reason: Leads to unwieldy function signatures as validation options grow

Global Configuration Object

Benefits: One-time configuration affects all function calls Drawbacks: Global state, less flexible per-call control, testing complexity Rejection Reason: Global state conflicts with functional approach

Validation Profile Enums

Benefits: Simple selection between predefined validation sets Drawbacks: Limited flexibility, configuration coupling Rejection Reason: Insufficient granularity for diverse use case requirements

Builder Pattern Configuration

Benefits: Fluent interface, incremental configuration building Drawbacks: Over-engineering for configuration object, additional complexity Rejection Reason: Functional configuration object simpler and more maintainable

Consequences

Positive Consequences

  • Performance Control: Skip expensive validations for performance-critical workflows

  • Use Case Flexibility: Appropriate validation for security, performance, or accuracy requirements

  • Threshold Configurability: Adjust validation parameters for different content types

  • Default Behavior: Zero-configuration operation for common use cases

  • Structured Configuration: Clear configuration object with documented semantics

Negative Consequences

  • Configuration Complexity: Additional parameter and configuration object increase cognitive load

  • Testing Matrix: Behavior combinations create large test space requiring systematic coverage

  • Documentation Overhead: Configuration options require comprehensive documentation and examples

  • Implementation Complexity: Conditional validation logic increases internal implementation complexity

Neutral Consequences

  • Migration Strategy: Existing code continues working with default behaviors

  • Future Extensibility: Configuration pattern provides foundation for additional validation options

  • Performance Characteristics: Behavior selection affects performance profiles predictably

Implementation Guidance

Performance-Optimized Configuration:

# Quick charset detection for decoding
fast_behaviors = Behaviors(
    trial_decode='never',
    validate_printable='never',
)

Security-Focused Configuration:

# Comprehensive validation for untrusted content
secure_behaviors = Behaviors(
    trial_decode='always',
    validate_printable='always',
    printable_threshold=0.05,  # Allow 5% non-printable
)

Content-Specific Configuration:

# Relaxed validation for code/data content
code_behaviors = Behaviors(
    printable_threshold=0.15,  # Allow more control characters
    validate_printable='as-needed',
)

Conditional Logic Implementation:

Internal implementation will evaluate behavior configuration to determine which validation steps to execute, maintaining performance characteristics appropriate for each configuration profile.

Integration with Error Class Provider:

Behaviors configuration works in conjunction with error class provider pattern to provide complete control over validation execution and error handling:

result = detect_mimetype_charset(
    content, location,
    behaviors=secure_behaviors,
    error_class_provider=security_error_mapper,
)

This decision provides the foundation for performance-aware and context-sensitive validation that addresses the rigid validation limitations of the v1.x functional approach while maintaining backward compatibility through sensible defaults.