001. Python API Design Specification

Overview

This document specifies the Python API design for the detextive library’s initial feature set, implementing faithful functional reproduction of existing text detection capabilities from mimeogram, cache proxy, and ai-experiments packages.

The design prioritizes behavioral fidelity and minimal migration effort while following established project practices for interface contracts, module organization, and naming conventions.

Public Interface Specification

Core Detection Functions

Character Encoding Detection

def detect_charset( content: bytes ) -> __.typx.Optional[ str ]:
    ''' Detects character encoding with UTF-8 preference and validation.

        Returns None if no reliable encoding can be determined.
    '''

MIME Type Detection

def detect_mimetype(
    content: bytes,
    location: __.cabc.Sequence[ str ] | __.Path | str
) -> __.typx.Optional[ str ]:
    ''' Detects MIME type using content analysis and extension fallback.

        Returns standardized MIME type strings or None if detection fails.
    '''

Combined Detection with Parameter Overrides

def detect_mimetype_and_charset(
    content: bytes,
    location: __.cabc.Sequence[ str ] | __.Path | str, *,
    mimetype: __.Absential[ str ] = __.absent,
    charset: __.Absential[ str ] = __.absent,
) -> tuple[ str, __.typx.Optional[ str ] ]:
    ''' Detects MIME type and charset with optional parameter overrides.

        Returns tuple of (mimetype, charset). MIME type defaults to
        'text/plain' if charset detected but MIME type unknown, or
        'application/octet-stream' if neither detected.
    '''

Textual Content Validation

def is_textual_mimetype( mimetype: str ) -> bool:
    ''' Validates if MIME type represents textual content.

        Consolidates textual MIME type patterns from all source
        implementations. Supports text/* prefix, specific application
        types (JSON, XML, JavaScript, etc.), and textual suffixes
        (+xml, +json, +yaml, +toml).

        Returns True for MIME types representing textual content.
    '''

def is_textual_content( content: bytes ) -> bool:
    ''' Determines if byte content represents textual data.

        Returns True for content that can be reliably processed as text.
    '''

Line Separator Processing

LineSeparators Enum

class LineSeparators( __.enum.Enum ):
    ''' Line separators for cross-platform text processing. '''

    CR = '\r'     # Classic MacOS (0xD)
    CRLF = '\r\n' # DOS/Windows (0xD 0xA)
    LF = '\n'     # Unix/Linux (0xA)

    @classmethod
    def detect_bytes(
        selfclass,
        content: __.cabc.Sequence[ int ] | bytes,
        limit: int = 1024
    ) -> __.typx.Optional[ 'LineSeparators' ]:
        ''' Detects line separator from byte content sample.

            Returns detected LineSeparators enum member or None.
        '''

    @classmethod
    def normalize_universal( selfclass, content: str ) -> str:
        ''' Normalizes all line separators to Unix LF format.
        '''

    def normalize( self, content: str ) -> str:
        ''' Normalizes specific line separator to Unix LF format.
        '''

    def nativize( self, content: str ) -> str:
        ''' Converts Unix LF to this platform's line separator.
        '''

Interface Contract Principles

Wide Parameters, Narrow Returns

Parameter Design: - Accept abstract base classes for maximum flexibility - Support multiple input formats (bytes, Path, str, Sequence[str]) - Use Union types for naturally variable inputs

Return Design: - Return concrete, immutable types (str, tuple, enum members) - Prefer specific types over generic containers - Use None for explicit “not detected” semantics

Examples:

# Wide parameters: accept any sequence-like or path-like input
location: __.cabc.Sequence[ str ] | __.Path | str
content: __.cabc.Sequence[ int ] | bytes

# Narrow returns: specific immutable types
-> __.typx.Optional[ str ]                        # Explicit None for "not detected"
-> tuple[ str, __.typx.Optional[ str ] ]          # Immutable tuple with concrete types
-> __.typx.Optional[ LineSeparators ]             # Specific enum member

Type Annotation Patterns

Function Signatures:

# Use Annotated for documented parameter types
Content: __.typx.TypeAlias = __.typx.Annotated[
    bytes,
    __.ddoc.Doc( "Raw byte content for analysis." )
]

Location: __.typx.TypeAlias = __.typx.Annotated[
    __.typx.Union[ str, __.Path, __.cabc.Sequence[ str ] ],
    __.ddoc.Doc( "File path, URL, or path components for context." )
]

# Comprehensive annotations with Absential pattern
def detect_mimetype_and_charset(
    content: Content,
    location: Location, *,
    mimetype: __.Absential[ str ] = __.absent,
    charset: __.Absential[ str ] = __.absent,
) -> tuple[ str, __.typx.Optional[ str ] ]:

Absential Pattern Usage: - Distinguish “not provided” (absent) from “explicitly None” - Enable three-state parameters: absent | None | value - Preserve complex parameter handling from mimeogram

Module Organization Design

Package Structure

sources/detextive/
├── __/
│   ├── __init__.py          # Re-exports: cabc, typx, enum, Absential
│   ├── imports.py           # chardet, puremagic, mimetypes
│   └── nomina.py            # Project-specific constants
├── __init__.py              # Public API re-exports from implementation modules
├── py.typed                 # Type checking marker
├── detection.py             # Core detection function implementations
├── exceptions.py            # Package exception hierarchy
└── lineseparators.py        # LineSeparators enum and utilities

Module Responsibilities:

Module Responsibilities:

`__init__.py` (Main Module): - Re-exports public API from implementation modules - Module organization: imports → re-exports

`detection.py`: - Core detection function implementations: detect_charset, detect_mimetype, detect_mimetype_and_charset - Textual content validation: is_textual_mimetype, is_textual_content - Private heuristic functions: _is_probable_textual_content (used internally by validation logic) - Consolidates detection logic from all source implementations

`lineseparators.py`: - LineSeparators enum class with all methods - Direct migration preserving existing byte-level detection logic - Cross-platform line ending handling utilities

`exceptions.py`: - Package exception hierarchy: Omniexception → Omnierror → specific exceptions - Detection-specific exceptions following nomenclature patterns

Additional Dependencies:

The implementation will require imports for chardet, mimetypes, puremagic external libraries, and dynadoc for parameter documentation annotations.

Private Constants Organization:

# Textual MIME type patterns (consolidated from all sources)
_TEXTUAL_MIME_TYPES = frozenset((
    'application/json',
    'application/xml',
    'application/javascript',
    'application/ecmascript',
    'application/graphql',          # From ai-experiments
    'application/ld+json',          # From cache proxy
    'application/x-httpd-php',      # From ai-experiments
    'application/x-latex',          # From ai-experiments
    'application/x-perl',           # From mimeogram
    'application/x-python',         # From mimeogram
    'application/x-ruby',           # From mimeogram
    'application/x-shell',          # From mimeogram
    'application/x-tex',            # From ai-experiments
    'application/x-yaml',           # From cache proxy
    'application/yaml',             # From cache proxy
    'image/svg+xml',
))

_TEXTUAL_SUFFIXES = ('+xml', '+json', '+yaml', '+toml')

Exception Hierarchy Design

Following Omniexception → Omnierror Pattern

class Omniexception(__.immut.Object, BaseException):
    ''' Base for all exceptions raised by detextive package. '''

class Omnierror(Omniexception, Exception):
    ''' Base for error exceptions raised by detextive package. '''

# Specific exceptions following nomenclature patterns
class CharsetDetectFailure( Omnierror, RuntimeError ):
    ''' Raised when character encoding detection fails. '''

class ContentDecodeFailure( Omnierror, UnicodeError ):
    ''' Raised when content cannot be decoded with detected charset. '''

class TextualMimetypeInvalidity( Omnierror, ValueError ):
    ''' Raised when MIME type is invalid for textual content processing. '''

Implementation Considerations

Behavioral Fidelity Requirements

UTF-8 Bias Logic: - Prefer UTF-8 for ASCII-compatible content - Validate detected charsets through trial decoding - Return ‘utf-8’ for successful UTF-8 decoding of non-UTF charsets

MIME Type Fallback Chain: - Primary: puremagic content-based detection - Fallback: mimetypes extension-based detection - Default: ‘text/plain’ if charset detected, ‘application/octet-stream’ otherwise

Parameter Validation: - Preserve complex logic from detect_mimetype_and_charset - Apply textual MIME type validation with trial decoding - Handle mixed parameter states using Absential pattern

Performance Characteristics: - Sample-based line separator detection (default 1KB limit) for performance on large files - Lazy evaluation of detection operations - Minimal abstraction to preserve existing performance