Basic Usage¶
This section demonstrates core text detection capabilities. Examples progress from simple detection to combined inference and high-level text processing.
Character Encoding Detection¶
Basic Encoding Detection¶
Detect character encoding from byte content:
>>> import detextive
>>> content = b'Hello, world!'
>>> charset = detextive.detect_charset( content )
>>> charset
'utf-8'
UTF-8 content with special characters:
>>> content = b'Caf\xc3\xa9 \xe2\x98\x85'
>>> charset = detextive.detect_charset( content )
>>> charset
'utf-8'
Non-ASCII encodings can be detected with sufficient content:
>>> content = 'Café Restaurant Menu\nEntrées: Soupe, Salade'.encode( 'iso-8859-1' )
>>> charset = detextive.detect_charset( content )
>>> charset
'iso8859-9'
MIME Type Detection¶
Content-Based Detection¶
Detect MIME types from file content using magic bytes:
>>> import detextive
>>> json_content = b'{"name": "example", "value": 42}'
>>> mimetype = detextive.detect_mimetype( json_content )
>>> mimetype in ('application/json', 'text/plain') # text/plain on Windows with python-magic-bin
True
Location-aware detection combines content analysis with file extension:
# For plain text without magic bytes, location helps determine MIME type
text_content = b'Plain text content'
try:
mimetype = detextive.detect_mimetype( text_content, location = 'document.txt' )
print( f"Text file MIME type: {mimetype}" )
except detextive.exceptions.MimetypeDetectFailure:
print( "Could not detect MIME type - need more distinctive content" )
# Note: Plain text without magic bytes may require charset detection
Binary content is correctly identified:
>>> pdf_header = b'%PDF-1.4'
>>> mimetype = detextive.detect_mimetype( pdf_header )
>>> mimetype
'application/pdf'
Combined Inference¶
MIME Type and Charset Together¶
For best accuracy, detect both MIME type and charset simultaneously:
>>> import detextive
>>> content = b'{"message": "Hello"}'
>>> mimetype, charset = detextive.infer_mimetype_charset( content, location = 'data.json' )
>>> mimetype
'application/json'
>>> charset
'utf-8'
Plain text files with location context:
>>> content = b'Sample document content'
>>> mimetype, charset = detextive.infer_mimetype_charset( content, location = 'readme.txt' )
>>> mimetype
'text/plain'
>>> charset
'utf-8'
Confidence-Based Detection¶
Access confidence scores for detection decisions using the confidence API:
>>> import detextive
>>> content = b'{"name": "example", "data": "test"}'
>>> mimetype_result, charset_result = detextive.infer_mimetype_charset_confidence( content, location = 'config.json' )
>>> mimetype_result.mimetype
'application/json'
>>> mimetype_result.confidence > 0.8
True
>>> charset_result.charset
'utf-8'
>>> charset_result.confidence > 0.8
True
The confidence API is useful for quality assessment and decision making:
>>> text_content = b'Plain text without magic bytes'
>>> mimetype_result, charset_result = detextive.infer_mimetype_charset_confidence( text_content, location = 'notes.txt' )
>>> mimetype_result.mimetype
'text/plain'
>>> mimetype_result.confidence > 0.7
True
High-Level Decoding¶
Automatic Text Decoding¶
The decode function provides complete bytes-to-text processing:
>>> import detextive
>>> content = b'Hello, world!'
>>> text = detextive.decode( content )
>>> text
'Hello, world!'
UTF-8 content is properly decoded:
>>> content = b'Caf\xc3\xa9 \xe2\x98\x85'
>>> text = detextive.decode( content )
>>> text
'Café ★'
Location context improves decoding decisions:
>>> content = b'Sample content for analysis'
>>> text = detextive.decode( content, location = 'document.txt' )
>>> text
'Sample content for analysis'
Content Validation¶
MIME Type Classification¶
Check if MIME types represent textual content:
>>> import detextive
>>> detextive.is_textual_mimetype( 'text/plain' )
True
>>> detextive.is_textual_mimetype( 'application/json' )
True
>>> detextive.is_textual_mimetype( 'image/jpeg' )
False
Text Quality Validation¶
Validate that decoded text meets quality standards:
>>> import detextive
>>> text = "Hello, world!"
>>> detextive.is_valid_text( text )
True
Text with control characters fails validation:
>>> text_with_controls = "Hello\x00\x01world"
>>> detextive.is_valid_text( text_with_controls )
False
Different types of text content and their validation:
>>> detextive.is_valid_text( "Hello, world!" )
True
>>> detextive.is_valid_text( "Hello\x00\x01world" )
False
>>> detextive.is_valid_text( " \n\t " )
True
>>> detextive.is_valid_text( "" )
True