Advanced Configuration¶
This section demonstrates advanced usage including custom behaviors, confidence thresholds, HTTP Content-Type parsing, and comprehensive error handling.
Custom Behaviors¶
Confidence Thresholds¶
Control detection confidence requirements through custom behaviors:
>>> import detextive
>>> from detextive import Behaviors
Create custom behavior configuration with confidence-related parameters:
>>> strict_behaviors = Behaviors(
... bytes_quantity_confidence_divisor = 512,
... trial_decode_confidence = 0.9 )
>>> content = b'Hello, world!' * 50
Use custom behaviors for detection:
>>> result = detextive.detect_charset_confidence(
... content,
... behaviors = strict_behaviors )
>>> result.confidence > 0.8
True
>>> result.charset
'utf-8'
Trial Decode Configuration¶
Configure how trial decoding validates detected charsets:
>>> from detextive import BehaviorTristate
Always perform trial decodes for validation. The bytes_quantity_confidence_divisor parameter affects confidence scoring for detection:
>>> validation_behaviors = Behaviors(
... trial_decode = BehaviorTristate.Always,
... bytes_quantity_confidence_divisor = 256 )
>>> content = b'Content to validate through decoding'
Detect charset with validation through trial decoding:
>>> charset = detextive.detect_charset(
... content,
... behaviors = validation_behaviors )
>>> charset
'utf-8'
HTTP Content-Type Parsing¶
Content-Type Header Processing¶
Parse HTTP Content-Type headers to extract MIME type and charset:
>>> content_type = "application/json; charset=utf-8"
>>> mimetype, charset = detextive.parse_http_content_type( content_type )
>>> mimetype
'application/json'
>>> charset
'utf-8'
Content-Type headers without charset return absent for charset:
>>> mimetype, charset = detextive.parse_http_content_type( "application/json" )
>>> mimetype
'application/json'
>>> type( charset ).__name__
'AbsentSingleton'
Integration with Detection¶
Use parsed Content-Type information to guide detection:
>>> content = b'{"message": "Hello"}'
>>> http_header = "application/json; charset=utf-8"
Let HTTP header inform detection:
>>> mimetype, charset = detextive.infer_mimetype_charset(
... content,
... http_content_type = http_header )
>>> mimetype
'application/json'
>>> charset
'utf-8'
Location-Based Inference¶
Enhanced Context Awareness¶
Provide rich location context to improve detection accuracy. Paths are primarily used as a fallback for MIME type detection (via file extension) and for richer exception reporting:
>>> from pathlib import Path
>>> content = b'{"key": "value", "other": "data"}'
Use Path objects for precise location context:
>>> location = Path( 'document.json' )
>>> mimetype = detextive.detect_mimetype( content, location = location )
>>> mimetype in ('application/json', 'text/plain') # text/plain on Windows with python-magic-bin
True
Default Value Handling¶
Specify fallback values when detection confidence is insufficient:
ambiguous_content = b'some text'
mimetype, charset = detextive.infer_mimetype_charset(
ambiguous_content,
mimetype_supplement = 'text/plain',
charset_supplement = 'utf-8' )
print( f"Result (with defaults): {mimetype}, {charset}" )
# Output: Result (with defaults): text/plain, utf-8
Text Validation Profiles¶
Validation Profile Selection¶
Choose validation strictness based on your use case:
>>> text = "Sample text with ASCII characters"
>>> text_with_unicode = "Unicode: \u2606"
Different validation profiles have varying strictness levels:
>>> detextive.is_valid_text( text, profile = detextive.PROFILE_TEXTUAL )
True
>>> detextive.is_valid_text( text, profile = detextive.PROFILE_TERMINAL_SAFE )
True
>>> detextive.is_valid_text( text_with_unicode, profile = detextive.PROFILE_TEXTUAL )
True
Profile-Aware Decoding¶
Apply validation profiles during high-level decoding:
>>> content = b'Text for terminal display'
>>> text = detextive.decode(
... content,
... profile = detextive.PROFILE_TERMINAL_SAFE )
>>> text
'Text for terminal display'
Validation failures raise appropriate exceptions:
>>> import detextive.exceptions
>>> problematic = b'Text with\x00null bytes'
>>> try:
... detextive.decode( problematic, profile = detextive.PROFILE_TERMINAL_SAFE )
... except detextive.exceptions.TextInvalidity as exception:
... print( "Text validation failed" )
Text validation failed
Error Handling¶
Exception Hierarchy¶
Handle specific error conditions with appropriate exception types:
import detextive
from detextive.exceptions import (
CharsetDetectFailure,
TextInvalidity,
ContentDecodeFailure )
Attempt high-level processing with comprehensive error handling:
try:
text = detextive.decode( malformed_content, location = 'data.txt' )
except CharsetDetectFailure as exception:
print( f"Charset detection failed: {exception}" )
except TextInvalidity as exception:
print( f"Text validation failed: {exception}" )
except ContentDecodeFailure as exception:
print( f"Decoding failed: {exception}" )
except detextive.exceptions.Omnierror as exception:
print( f"General detextive error: {exception}" )
Integration Patterns¶
Complete Processing Pipeline¶
Combine multiple detection steps in a robust processing pipeline:
import detextive
from detextive import Behaviors, BehaviorTristate
def process_document( content, location = None, http_content_type = None ):
''' Processes document with comprehensive detection and validation. '''
behaviors = Behaviors(
charset_confidence_minimum = 75,
trial_decode = BehaviorTristate.AsNeeded )
try:
mimetype, charset = detextive.infer_mimetype_charset(
content,
behaviors = behaviors,
location = location,
http_content_type = http_content_type )
if not detextive.is_textual_mimetype( mimetype ):
return None, f"Non-textual content: {mimetype}"
text = detextive.decode(
content,
behaviors = behaviors,
profile = detextive.PROFILE_TEXTUAL,
location = location,
http_content_type = http_content_type )
return text, None
except detextive.exceptions.Omnierror as exception:
return None, f"Processing failed: {exception}"
Example usage:
content = b'{"message": "Hello, world!"}'
text, error = process_document( content, location = 'data.json' )
if text:
print( f"Processed text: {text}" )
else:
print( f"Processing error: {error}" )