Text Processing Examples

This section demonstrates practical usage of core text processing capabilities. Examples progress from basic usage to more advanced scenarios including error handling and edge cases.

Character Encoding Detection

Basic Encoding Detection

Detect character encoding from byte content:

>>> import detextive
>>> content = b'Hello, world!'
>>> encoding = detextive.detect_charset( content )
>>> print( encoding )
utf-8

UTF-8 content is correctly identified:

>>> content = b'Caf\xc3\xa9 \xe2\x98\x85'
>>> encoding = detextive.detect_charset( content )
>>> print( encoding )
utf-8

Empty content returns None:

>>> content = b''
>>> encoding = detextive.detect_charset( content )
>>> print( encoding )
None

MIME Type Detection

Content-Based Detection

Detect MIME types using magic numbers and file extensions:

>>> import detextive
>>> from pathlib import Path
>>>
>>> content = b'{"name": "example", "value": 42}'
>>> mimetype = detextive.detect_mimetype( content, 'data.json' )
>>> print( mimetype )
application/json

JPEG image detection using magic numbers:

>>> content = b'\xff\xd8\xff\xe0\x00\x10JFIF'
>>> mimetype = detextive.detect_mimetype( content, 'photo.jpg' )
>>> print( mimetype )
image/jpeg

Extension Fallback

When magic number detection fails, extension-based detection is used:

>>> content = b'some content without magic numbers'
>>> mimetype = detextive.detect_mimetype( content, 'document.pdf' )
>>> print( mimetype )
application/pdf

Path objects work as location parameters:

>>> from pathlib import Path
>>> location = Path( 'document.txt' )
>>> content = b'Plain text content for demonstration'
>>> mimetype = detextive.detect_mimetype( content, location )
>>> print( mimetype )
text/plain

Combined Detection

Detecting Both MIME Type and Charset

Get both MIME type and character encoding in one call:

>>> content = b'<html><body>Hello World</body></html>'
>>> mimetype, charset = detextive.detect_mimetype_and_charset( content, 'page.html' )
>>> print( f'MIME: {mimetype}, Charset: {charset}' )
MIME: text/html, Charset: utf-8

For content with only charset detection:

>>> content = b'Just some plain text content'
>>> mimetype, charset = detextive.detect_mimetype_and_charset( content, 'unknown' )
>>> print( f'MIME: {mimetype}, Charset: {charset}' )
MIME: text/plain, Charset: utf-8

Content with unknown extension but detectable charset defaults to text/plain:

>>> content = b'readable text content without clear file type'
>>> mimetype, charset = detextive.detect_mimetype_and_charset( content, 'unknown_file' )
>>> print( f'MIME: {mimetype}, Charset: {charset}' )
MIME: text/plain, Charset: utf-8

Override Parameters

Override detected values using parameter overrides:

>>> content = b'<?xml version="1.0"?><root>data</root>'
>>> mimetype, charset = detextive.detect_mimetype_and_charset(
...     content, 'data.xml', charset = 'iso-8859-1'
... )
>>> print( f'MIME: {mimetype}, Charset: {charset}' )
MIME: application/xml, Charset: iso-8859-1

Content Validation

MIME Type Validation

Check if MIME types represent textual content:

>>> import detextive
>>>
>>> print( detextive.is_textual_mimetype( 'text/plain' ) )
True
>>> print( detextive.is_textual_mimetype( 'text/html' ) )
True

Application types with textual content:

>>> print( detextive.is_textual_mimetype( 'application/json' ) )
True
>>> print( detextive.is_textual_mimetype( 'application/xml' ) )
True
>>> print( detextive.is_textual_mimetype( 'application/javascript' ) )
True

Textual suffixes are recognized:

>>> print( detextive.is_textual_mimetype( 'application/vnd.api+json' ) )
True
>>> print( detextive.is_textual_mimetype( 'application/custom+xml' ) )
True

Non-textual types return False:

>>> print( detextive.is_textual_mimetype( 'image/jpeg' ) )
False
>>> print( detextive.is_textual_mimetype( 'video/mp4' ) )
False
>>> print( detextive.is_textual_mimetype( 'application/octet-stream' ) )
False

Edge Cases

Empty and malformed MIME types:

>>> print( detextive.is_textual_mimetype( '' ) )
False
>>> print( detextive.is_textual_mimetype( 'invalid' ) )
False

Text Reasonableness Testing

Validate that byte content represents textual data:

>>> import detextive
>>>
>>> content = b'This is readable text with proper formatting.'
>>> print( detextive.is_textual_content( content ) )
True

Content with acceptable whitespace:

>>> content = b'Line 1\n\tIndented line\nLast line'
>>> print( detextive.is_textual_content( content ) )
True

Rejecting Non-Textual Content

Empty content is rejected:

>>> print( detextive.is_textual_content( b'' ) )
False

Non-textual content is rejected:

>>> content = b'\x00\x01\x02\x03\x04\x05'
>>> print( detextive.is_textual_content( content ) )
False

Line Separator Detection

Detecting Line Endings

Detect line separators from byte content:

>>> import detextive
>>>
>>> content = b'line1\nline2\nline3'
>>> separator = detextive.LineSeparators.detect_bytes( content )
>>> print( separator )
LineSeparators.LF

Windows line endings:

>>> content = b'line1\r\nline2\r\nline3'
>>> separator = detextive.LineSeparators.detect_bytes( content )
>>> print( separator )
LineSeparators.CRLF

No line separators found:

>>> content = b'just one line'
>>> separator = detextive.LineSeparators.detect_bytes( content )
>>> print( separator )
None

Line Ending Normalization

Universal Normalization

Convert all line endings to Unix format:

>>> import detextive
>>> content = 'Line 1\r\nLine 2\rLine 3\nLine 4'
>>> normalized = detextive.LineSeparators.normalize_universal( content )
>>> print( repr( normalized ) )
'Line 1\nLine 2\nLine 3\nLine 4'

Specific Line Ending Conversion

Convert specific line endings:

>>> content = 'First line\r\nSecond line'
>>> result = detextive.LineSeparators.CRLF.normalize( content )
>>> print( repr( result ) )
'First line\nSecond line'

Convert Unix endings to platform-specific:

>>> content = 'First line\nSecond line'
>>> result = detextive.LineSeparators.CRLF.nativize( content )
>>> print( repr( result ) )
'First line\r\nSecond line'

Error Handling

Exception Scenarios

The exception hierarchy follows standard patterns. Exception classes are available for handling error conditions:

>>> import detextive
>>> from detextive import exceptions
>>>
>>> print( hasattr( exceptions, 'TextualMimetypeInvalidity' ) )
True

The exception hierarchy follows standard patterns:

>>> print( issubclass( exceptions.TextualMimetypeInvalidity, exceptions.Omnierror ) )
True
>>> print( issubclass( exceptions.Omnierror, exceptions.Omniexception ) )
True