Basic Usage

This section demonstrates core text detection capabilities. Examples progress from simple detection to combined inference and high-level text processing.

Character Encoding Detection

Basic Encoding Detection

Detect character encoding from byte content:

>>> import detextive
>>> content = b'Hello, world!'
>>> charset = detextive.detect_charset( content )
>>> charset
'utf-8'

UTF-8 content with special characters:

>>> content = b'Caf\xc3\xa9 \xe2\x98\x85'
>>> charset = detextive.detect_charset( content )
>>> charset
'utf-8'

Non-ASCII encodings can be detected with sufficient content:

>>> content = 'Café Restaurant Menu\nEntrées: Soupe, Salade'.encode( 'iso-8859-1' )
>>> charset = detextive.detect_charset( content )
>>> charset
'iso8859-9'

MIME Type Detection

Content-Based Detection

Detect MIME types from file content using magic bytes:

>>> import detextive
>>> json_content = b'{"name": "example", "value": 42}'
>>> mimetype = detextive.detect_mimetype( json_content )
>>> mimetype in ('application/json', 'text/plain')  # text/plain on Windows with python-magic-bin
True

Location-aware detection combines content analysis with file extension:

# For plain text without magic bytes, location helps determine MIME type
text_content = b'Plain text content'
try:
    mimetype = detextive.detect_mimetype( text_content, location = 'document.txt' )
    print( f"Text file MIME type: {mimetype}" )
except detextive.exceptions.MimetypeDetectFailure:
    print( "Could not detect MIME type - need more distinctive content" )
# Note: Plain text without magic bytes may require charset detection

Binary content is correctly identified:

>>> pdf_header = b'%PDF-1.4'
>>> mimetype = detextive.detect_mimetype( pdf_header )
>>> mimetype
'application/pdf'

Combined Inference

MIME Type and Charset Together

For best accuracy, detect both MIME type and charset simultaneously:

>>> import detextive
>>> content = b'{"message": "Hello"}'
>>> mimetype, charset = detextive.infer_mimetype_charset( content, location = 'data.json' )
>>> mimetype
'application/json'
>>> charset
'utf-8'

Plain text files with location context:

>>> content = b'Sample document content'
>>> mimetype, charset = detextive.infer_mimetype_charset( content, location = 'readme.txt' )
>>> mimetype
'text/plain'
>>> charset
'utf-8'

Confidence-Based Detection

Access confidence scores for detection decisions using the confidence API:

>>> import detextive
>>> content = b'{"name": "example", "data": "test"}'
>>> mimetype_result, charset_result = detextive.infer_mimetype_charset_confidence( content, location = 'config.json' )
>>> mimetype_result.mimetype
'application/json'
>>> mimetype_result.confidence > 0.8
True
>>> charset_result.charset
'utf-8'
>>> charset_result.confidence > 0.8
True

The confidence API is useful for quality assessment and decision making:

>>> text_content = b'Plain text without magic bytes'
>>> mimetype_result, charset_result = detextive.infer_mimetype_charset_confidence( text_content, location = 'notes.txt' )
>>> mimetype_result.mimetype
'text/plain'
>>> mimetype_result.confidence > 0.7
True

High-Level Decoding

Automatic Text Decoding

The decode function provides complete bytes-to-text processing:

>>> import detextive
>>> content = b'Hello, world!'
>>> text = detextive.decode( content )
>>> text
'Hello, world!'

UTF-8 content is properly decoded:

>>> content = b'Caf\xc3\xa9 \xe2\x98\x85'
>>> text = detextive.decode( content )
>>> text
'Café ★'

Location context improves decoding decisions:

>>> content = b'Sample content for analysis'
>>> text = detextive.decode( content, location = 'document.txt' )
>>> text
'Sample content for analysis'

Content Validation

MIME Type Classification

Check if MIME types represent textual content:

>>> import detextive
>>> detextive.is_textual_mimetype( 'text/plain' )
True
>>> detextive.is_textual_mimetype( 'application/json' )
True
>>> detextive.is_textual_mimetype( 'image/jpeg' )
False

Text Quality Validation

Validate that decoded text meets quality standards:

>>> import detextive
>>> text = "Hello, world!"
>>> detextive.is_valid_text( text )
True

Text with control characters fails validation:

>>> text_with_controls = "Hello\x00\x01world"
>>> detextive.is_valid_text( text_with_controls )
False

Different types of text content and their validation:

>>> detextive.is_valid_text( "Hello, world!" )
True
>>> detextive.is_valid_text( "Hello\x00\x01world" )
False
>>> detextive.is_valid_text( "   \n\t  " )
True
>>> detextive.is_valid_text( "" )
True