Text Processing Examples¶
This section demonstrates practical usage of core text processing capabilities. Examples progress from basic usage to more advanced scenarios including error handling and edge cases.
Character Encoding Detection¶
Basic Encoding Detection¶
Detect character encoding from byte content:
>>> import detextive
>>> content = b'Hello, world!'
>>> encoding = detextive.detect_charset( content )
>>> print( encoding )
utf-8
UTF-8 content is correctly identified:
>>> content = b'Caf\xc3\xa9 \xe2\x98\x85'
>>> encoding = detextive.detect_charset( content )
>>> print( encoding )
utf-8
Empty content returns None
:
>>> content = b''
>>> encoding = detextive.detect_charset( content )
>>> print( encoding )
None
MIME Type Detection¶
Content-Based Detection¶
Detect MIME types using magic numbers and file extensions:
>>> import detextive
>>> from pathlib import Path
>>>
>>> content = b'{"name": "example", "value": 42}'
>>> mimetype = detextive.detect_mimetype( content, 'data.json' )
>>> print( mimetype )
application/json
JPEG image detection using magic numbers:
>>> content = b'\xff\xd8\xff\xe0\x00\x10JFIF'
>>> mimetype = detextive.detect_mimetype( content, 'photo.jpg' )
>>> print( mimetype )
image/jpeg
Extension Fallback¶
When magic number detection fails, extension-based detection is used:
>>> content = b'some content without magic numbers'
>>> mimetype = detextive.detect_mimetype( content, 'document.pdf' )
>>> print( mimetype )
application/pdf
Path objects work as location parameters:
>>> from pathlib import Path
>>> location = Path( 'document.txt' )
>>> content = b'Plain text content for demonstration'
>>> mimetype = detextive.detect_mimetype( content, location )
>>> print( mimetype )
text/plain
Combined Detection¶
Detecting Both MIME Type and Charset¶
Get both MIME type and character encoding in one call:
>>> content = b'<html><body>Hello World</body></html>'
>>> mimetype, charset = detextive.detect_mimetype_and_charset( content, 'page.html' )
>>> print( f'MIME: {mimetype}, Charset: {charset}' )
MIME: text/html, Charset: utf-8
For content with only charset detection:
>>> content = b'Just some plain text content'
>>> mimetype, charset = detextive.detect_mimetype_and_charset( content, 'unknown' )
>>> print( f'MIME: {mimetype}, Charset: {charset}' )
MIME: text/plain, Charset: utf-8
Content with unknown extension but detectable charset defaults to text/plain:
>>> content = b'readable text content without clear file type'
>>> mimetype, charset = detextive.detect_mimetype_and_charset( content, 'unknown_file' )
>>> print( f'MIME: {mimetype}, Charset: {charset}' )
MIME: text/plain, Charset: utf-8
Override Parameters¶
Override detected values using parameter overrides:
>>> content = b'<?xml version="1.0"?><root>data</root>'
>>> mimetype, charset = detextive.detect_mimetype_and_charset(
... content, 'data.xml', charset = 'iso-8859-1'
... )
>>> print( f'MIME: {mimetype}, Charset: {charset}' )
MIME: application/xml, Charset: iso-8859-1
Content Validation¶
MIME Type Validation¶
Check if MIME types represent textual content:
>>> import detextive
>>>
>>> print( detextive.is_textual_mimetype( 'text/plain' ) )
True
>>> print( detextive.is_textual_mimetype( 'text/html' ) )
True
Application types with textual content:
>>> print( detextive.is_textual_mimetype( 'application/json' ) )
True
>>> print( detextive.is_textual_mimetype( 'application/xml' ) )
True
>>> print( detextive.is_textual_mimetype( 'application/javascript' ) )
True
Textual suffixes are recognized:
>>> print( detextive.is_textual_mimetype( 'application/vnd.api+json' ) )
True
>>> print( detextive.is_textual_mimetype( 'application/custom+xml' ) )
True
Non-textual types return False
:
>>> print( detextive.is_textual_mimetype( 'image/jpeg' ) )
False
>>> print( detextive.is_textual_mimetype( 'video/mp4' ) )
False
>>> print( detextive.is_textual_mimetype( 'application/octet-stream' ) )
False
Edge Cases¶
Empty and malformed MIME types:
>>> print( detextive.is_textual_mimetype( '' ) )
False
>>> print( detextive.is_textual_mimetype( 'invalid' ) )
False
Text Reasonableness Testing¶
Validate that byte content represents textual data:
>>> import detextive
>>>
>>> content = b'This is readable text with proper formatting.'
>>> print( detextive.is_textual_content( content ) )
True
Content with acceptable whitespace:
>>> content = b'Line 1\n\tIndented line\nLast line'
>>> print( detextive.is_textual_content( content ) )
True
Rejecting Non-Textual Content¶
Empty content is rejected:
>>> print( detextive.is_textual_content( b'' ) )
False
Non-textual content is rejected:
>>> content = b'\x00\x01\x02\x03\x04\x05'
>>> print( detextive.is_textual_content( content ) )
False
Line Separator Detection¶
Detecting Line Endings¶
Detect line separators from byte content:
>>> import detextive
>>>
>>> content = b'line1\nline2\nline3'
>>> separator = detextive.LineSeparators.detect_bytes( content )
>>> print( separator )
LineSeparators.LF
Windows line endings:
>>> content = b'line1\r\nline2\r\nline3'
>>> separator = detextive.LineSeparators.detect_bytes( content )
>>> print( separator )
LineSeparators.CRLF
No line separators found:
>>> content = b'just one line'
>>> separator = detextive.LineSeparators.detect_bytes( content )
>>> print( separator )
None
Line Ending Normalization¶
Universal Normalization¶
Convert all line endings to Unix format:
>>> import detextive
>>> content = 'Line 1\r\nLine 2\rLine 3\nLine 4'
>>> normalized = detextive.LineSeparators.normalize_universal( content )
>>> print( repr( normalized ) )
'Line 1\nLine 2\nLine 3\nLine 4'
Specific Line Ending Conversion¶
Convert specific line endings:
>>> content = 'First line\r\nSecond line'
>>> result = detextive.LineSeparators.CRLF.normalize( content )
>>> print( repr( result ) )
'First line\nSecond line'
Convert Unix endings to platform-specific:
>>> content = 'First line\nSecond line'
>>> result = detextive.LineSeparators.CRLF.nativize( content )
>>> print( repr( result ) )
'First line\r\nSecond line'
Error Handling¶
Exception Scenarios¶
The exception hierarchy follows standard patterns. Exception classes are available for handling error conditions:
>>> import detextive
>>> from detextive import exceptions
>>>
>>> print( hasattr( exceptions, 'TextualMimetypeInvalidity' ) )
True
The exception hierarchy follows standard patterns:
>>> print( issubclass( exceptions.TextualMimetypeInvalidity, exceptions.Omnierror ) )
True
>>> print( issubclass( exceptions.Omnierror, exceptions.Omniexception ) )
True