Product Requirements Document

Executive Summary

The detextive library provides consolidated text detection and processing capabilities to replace duplicated MIME type detection, charset detection, and newline processing across multiple Python packages. It serves as a drop-in replacement that standardizes textual content analysis with consistent APIs and improved reliability.

Problem Statement

Multiple Python packages in the project ecosystem contain duplicated implementations of text detection functionality:

  • python-mimeogram: MIME type and charset detection in acquirers.py and parts.py

  • python-librovore: Textual MIME type validation in cacheproxy.py

  • ai-experiments: Charset detection and MIME type validation in utilities.py

This duplication creates maintenance overhead, inconsistent behavior, and increases the likelihood of bugs. Each implementation has evolved separately with different edge case handling and detection heuristics.

Goals and Objectives

Primary Objectives:

  • Consolidate text detection functionality into a single, well-tested library

  • Provide drop-in replacement APIs that minimize migration effort

  • Improve detection accuracy and consistency across all dependent packages

Secondary Objectives:

  • Reduce maintenance overhead by eliminating code duplication

  • Establish standardized text processing patterns for future projects

  • Enable easier testing and validation of text detection logic

Success Metrics:

  • All dependent packages successfully migrate with minimal code changes

  • Detection accuracy matches or exceeds existing implementations

  • Library passes comprehensive test suite covering edge cases

Target Users

Primary Users:

  • Internal Developers: Team members working on mimeogram, librovore, and ai-experiments packages

  • Package Maintainers: Developers responsible for library maintenance and updates

Usage Context:

  • Integration as a dependency in existing Python packages

  • Programmatic text analysis and content processing workflows

  • File and web content processing pipelines

Functional Requirements

REQ-001: MIME Type Detection API (Critical)

As a developer, I want to detect MIME types from byte content so that I can determine appropriate content handling strategies.

Acceptance Criteria: - Detect MIME types using content-based analysis (magic bytes) - Fall back to file extension-based detection when content detection fails - Support both file paths and raw byte content as input - Return standardized MIME type strings (e.g., “text/plain”, “application/json”)

REQ-002: Charset Detection API (Critical)

As a developer, I want to detect character encoding from byte content so that I can decode text properly without encoding errors.

Acceptance Criteria: - Auto-detect character encoding using statistical analysis - Prefer UTF-8 when ASCII content could be either ASCII or UTF-8 - Validate detected encodings by attempting decode operations - Return encoding names compatible with Python’s codec system

REQ-003: Line Separator Processing (Critical)

As a developer, I want to detect and normalize line separators so that I can process text consistently across different platforms.

Acceptance Criteria: - Detect line separator types (CR, LF, CRLF) from byte or text content - Normalize line endings to Unix LF format - Convert line endings to platform-specific formats when needed - Handle mixed line ending scenarios gracefully

REQ-004: Textual Content Validation (High)

As a developer, I want to determine if content represents meaningful text so that I can avoid processing binary data as text.

Acceptance Criteria: - Classify MIME types as textual or non-textual - Support extensible patterns for textual MIME type detection - Validate decoded text content using heuristics (control character ratios, printable character ratios) - Handle edge cases like empty content and single-character repetition

REQ-005: Drop-in Replacement Interface (High)

As a developer migrating existing code, I want compatible APIs so that I can replace existing functions with minimal code changes.

Acceptance Criteria: - Maintain similar function signatures to existing implementations - Support same input/output data types where possible - Preserve existing behavior for common use cases - Provide clear migration documentation for API differences

Non-Functional Requirements

Performance Requirements: - MIME type detection should complete within 100ms for files up to 1MB - Charset detection should analyze sufficient content sample (default 1KB) for accuracy - Memory usage should remain proportional to sample size, not full file size

Reliability Requirements: - Library should handle malformed or unusual content without crashing - Error conditions should be clearly communicated through appropriate exceptions - Detection accuracy should be >= 95% for common text formats

Compatibility Requirements: - Support Python 3.8+ (matching existing package requirements) - Compatible with existing dependency versions in target packages - Platform-independent operation (Windows, macOS, Linux)

Constraints and Assumptions

Technical Constraints: - Must integrate with existing package dependency management - Limited to detection libraries already used in the ecosystem (chardet, puremagic) - Cannot introduce breaking changes to existing public APIs during migration

Dependencies: - Migration requires coordination across multiple package maintainers - Success depends on comprehensive test coverage of existing behavior - Requires validation against real-world content from existing use cases

Assumptions: - Existing packages can accept new library dependency - Current detection logic represents desired behavior (not bugs to be fixed) - UTF-8 bias aligns with project content expectations

Out of Scope

  • Content conversion or transformation beyond line ending normalization

  • Support for legacy or exotic character encodings beyond what chardet provides

  • MIME type validation or correction (library reports detected types as-is)

  • Performance optimization for very large files (> 100MB)

  • Integration with external content detection services or APIs