Advanced Configuration

This section demonstrates advanced usage including custom behaviors, confidence thresholds, HTTP Content-Type parsing, and comprehensive error handling.

Custom Behaviors

Confidence Thresholds

Control detection confidence requirements through custom behaviors:

>>> import detextive
>>> from detextive import Behaviors

Create custom behavior configuration with confidence-related parameters:

>>> strict_behaviors = Behaviors(
...     bytes_quantity_confidence_divisor = 512,
...     trial_decode_confidence = 0.9 )
>>> content = b'Hello, world!' * 50

Use custom behaviors for detection:

>>> result = detextive.detect_charset_confidence(
...     content,
...     behaviors = strict_behaviors )
>>> result.confidence > 0.8
True
>>> result.charset
'utf-8'

Trial Decode Configuration

Configure how trial decoding validates detected charsets:

>>> from detextive import BehaviorTristate

Always perform trial decodes for validation. The bytes_quantity_confidence_divisor parameter affects confidence scoring for detection:

>>> validation_behaviors = Behaviors(
...     trial_decode = BehaviorTristate.Always,
...     bytes_quantity_confidence_divisor = 256 )
>>> content = b'Content to validate through decoding'

Detect charset with validation through trial decoding:

>>> charset = detextive.detect_charset(
...     content,
...     behaviors = validation_behaviors )
>>> charset
'utf-8'

HTTP Content-Type Parsing

Content-Type Header Processing

Parse HTTP Content-Type headers to extract MIME type and charset:

>>> content_type = "application/json; charset=utf-8"
>>> mimetype, charset = detextive.parse_http_content_type( content_type )
>>> mimetype
'application/json'
>>> charset
'utf-8'

Content-Type headers without charset return absent for charset:

>>> mimetype, charset = detextive.parse_http_content_type( "application/json" )
>>> mimetype
'application/json'
>>> type( charset ).__name__
'AbsentSingleton'

Integration with Detection

Use parsed Content-Type information to guide detection:

>>> content = b'{"message": "Hello"}'
>>> http_header = "application/json; charset=utf-8"

Let HTTP header inform detection:

>>> mimetype, charset = detextive.infer_mimetype_charset(
...     content,
...     http_content_type = http_header )
>>> mimetype
'application/json'
>>> charset
'utf-8'

Location-Based Inference

Enhanced Context Awareness

Provide rich location context to improve detection accuracy. Paths are primarily used as a fallback for MIME type detection (via file extension) and for richer exception reporting:

>>> from pathlib import Path
>>> content = b'{"key": "value", "other": "data"}'

Use Path objects for precise location context:

>>> location = Path( 'document.json' )
>>> mimetype = detextive.detect_mimetype( content, location = location )
>>> mimetype in ('application/json', 'text/plain')  # text/plain on Windows with python-magic-bin
True

Default Value Handling

Specify fallback values when detection confidence is insufficient:

ambiguous_content = b'some text'

mimetype, charset = detextive.infer_mimetype_charset(
    ambiguous_content,
    mimetype_supplement = 'text/plain',
    charset_supplement = 'utf-8' )

print( f"Result (with defaults): {mimetype}, {charset}" )
# Output: Result (with defaults): text/plain, utf-8

Text Validation Profiles

Validation Profile Selection

Choose validation strictness based on your use case:

>>> text = "Sample text with ASCII characters"
>>> text_with_unicode = "Unicode: \u2606"

Different validation profiles have varying strictness levels:

>>> detextive.is_valid_text( text, profile = detextive.PROFILE_TEXTUAL )
True
>>> detextive.is_valid_text( text, profile = detextive.PROFILE_TERMINAL_SAFE )
True
>>> detextive.is_valid_text( text_with_unicode, profile = detextive.PROFILE_TEXTUAL )
True

Profile-Aware Decoding

Apply validation profiles during high-level decoding:

>>> content = b'Text for terminal display'
>>> text = detextive.decode(
...     content,
...     profile = detextive.PROFILE_TERMINAL_SAFE )
>>> text
'Text for terminal display'

Validation failures raise appropriate exceptions:

>>> import detextive.exceptions
>>> problematic = b'Text with\x00null bytes'
>>> try:
...     detextive.decode( problematic, profile = detextive.PROFILE_TERMINAL_SAFE )
... except detextive.exceptions.TextInvalidity as exception:
...     print( "Text validation failed" )
Text validation failed

Error Handling

Exception Hierarchy

Handle specific error conditions with appropriate exception types:

import detextive
from detextive.exceptions import (
    CharsetDetectFailure,
    TextInvalidity,
    ContentDecodeFailure )

Attempt high-level processing with comprehensive error handling:

try:
    text = detextive.decode( malformed_content, location = 'data.txt' )
except CharsetDetectFailure as exception:
    print( f"Charset detection failed: {exception}" )
except TextInvalidity as exception:
    print( f"Text validation failed: {exception}" )
except ContentDecodeFailure as exception:
    print( f"Decoding failed: {exception}" )
except detextive.exceptions.Omnierror as exception:
    print( f"General detextive error: {exception}" )

Integration Patterns

Complete Processing Pipeline

Combine multiple detection steps in a robust processing pipeline:

import detextive
from detextive import Behaviors, BehaviorTristate

def process_document( content, location = None, http_content_type = None ):
    ''' Processes document with comprehensive detection and validation. '''
    behaviors = Behaviors(
        charset_confidence_minimum = 75,
        trial_decode = BehaviorTristate.AsNeeded )
    try:
        mimetype, charset = detextive.infer_mimetype_charset(
            content,
            behaviors = behaviors,
            location = location,
            http_content_type = http_content_type )
        if not detextive.is_textual_mimetype( mimetype ):
            return None, f"Non-textual content: {mimetype}"
        text = detextive.decode(
            content,
            behaviors = behaviors,
            profile = detextive.PROFILE_TEXTUAL,
            location = location,
            http_content_type = http_content_type )
        return text, None
    except detextive.exceptions.Omnierror as exception:
        return None, f"Processing failed: {exception}"

Example usage:

content = b'{"message": "Hello, world!"}'
text, error = process_document( content, location = 'data.json' )
if text:
    print( f"Processed text: {text}" )
else:
    print( f"Processing error: {error}" )