.. vim: set fileencoding=utf-8: .. -*- coding: utf-8 -*- .. +--------------------------------------------------------------------------+ | | | Licensed under the Apache License, Version 2.0 (the "License"); | | you may not use this file except in compliance with the License. | | You may obtain a copy of the License at | | | | http://www.apache.org/licenses/LICENSE-2.0 | | | | Unless required by applicable law or agreed to in writing, software | | distributed under the License is distributed on an "AS IS" BASIS, | | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | | See the License for the specific language governing permissions and | | limitations under the License. | | | +--------------------------------------------------------------------------+ ******************************************************************************* Text Processing Examples ******************************************************************************* This section demonstrates practical usage of core text processing capabilities. Examples progress from basic usage to more advanced scenarios including error handling and edge cases. Character Encoding Detection =============================================================================== Basic Encoding Detection ------------------------------------------------------------------------------- Detect character encoding from byte content: .. doctest:: Detection >>> import detextive >>> content = b'Hello, world!' >>> encoding = detextive.detect_charset( content ) >>> print( encoding ) utf-8 UTF-8 content is correctly identified: .. doctest:: Detection >>> content = b'Caf\xc3\xa9 \xe2\x98\x85' >>> encoding = detextive.detect_charset( content ) >>> print( encoding ) utf-8 Empty content returns ``None``: .. doctest:: Detection >>> content = b'' >>> encoding = detextive.detect_charset( content ) >>> print( encoding ) None MIME Type Detection =============================================================================== Content-Based Detection ------------------------------------------------------------------------------- Detect MIME types using magic numbers and file extensions: .. doctest:: Detection >>> import detextive >>> from pathlib import Path >>> >>> content = b'{"name": "example", "value": 42}' >>> mimetype = detextive.detect_mimetype( content, 'data.json' ) >>> print( mimetype ) application/json JPEG image detection using magic numbers: .. doctest:: Detection >>> content = b'\xff\xd8\xff\xe0\x00\x10JFIF' >>> mimetype = detextive.detect_mimetype( content, 'photo.jpg' ) >>> print( mimetype ) image/jpeg Extension Fallback ------------------------------------------------------------------------------- When magic number detection fails, extension-based detection is used: .. doctest:: Detection >>> content = b'some content without magic numbers' >>> mimetype = detextive.detect_mimetype( content, 'document.pdf' ) >>> print( mimetype ) application/pdf Path objects work as location parameters: .. doctest:: Detection >>> from pathlib import Path >>> location = Path( 'document.txt' ) >>> content = b'Plain text content for demonstration' >>> mimetype = detextive.detect_mimetype( content, location ) >>> print( mimetype ) text/plain Combined Detection =============================================================================== Detecting Both MIME Type and Charset ------------------------------------------------------------------------------- Get both MIME type and character encoding in one call: .. doctest:: Detection >>> content = b'Hello World' >>> mimetype, charset = detextive.detect_mimetype_and_charset( content, 'page.html' ) >>> print( f'MIME: {mimetype}, Charset: {charset}' ) MIME: text/html, Charset: utf-8 For content with only charset detection: .. doctest:: Detection >>> content = b'Just some plain text content' >>> mimetype, charset = detextive.detect_mimetype_and_charset( content, 'unknown' ) >>> print( f'MIME: {mimetype}, Charset: {charset}' ) MIME: text/plain, Charset: utf-8 Content with unknown extension but detectable charset defaults to text/plain: .. doctest:: Detection >>> content = b'readable text content without clear file type' >>> mimetype, charset = detextive.detect_mimetype_and_charset( content, 'unknown_file' ) >>> print( f'MIME: {mimetype}, Charset: {charset}' ) MIME: text/plain, Charset: utf-8 Override Parameters ------------------------------------------------------------------------------- Override detected values using parameter overrides: .. doctest:: Detection >>> content = b'data' >>> mimetype, charset = detextive.detect_mimetype_and_charset( ... content, 'data.xml', charset = 'iso-8859-1' ... ) >>> print( f'MIME: {mimetype}, Charset: {charset}' ) MIME: application/xml, Charset: iso-8859-1 Content Validation =============================================================================== MIME Type Validation ------------------------------------------------------------------------------- Check if MIME types represent textual content: .. doctest:: Validation >>> import detextive >>> >>> print( detextive.is_textual_mimetype( 'text/plain' ) ) True >>> print( detextive.is_textual_mimetype( 'text/html' ) ) True Application types with textual content: .. doctest:: Validation >>> print( detextive.is_textual_mimetype( 'application/json' ) ) True >>> print( detextive.is_textual_mimetype( 'application/xml' ) ) True >>> print( detextive.is_textual_mimetype( 'application/javascript' ) ) True Textual suffixes are recognized: .. doctest:: Validation >>> print( detextive.is_textual_mimetype( 'application/vnd.api+json' ) ) True >>> print( detextive.is_textual_mimetype( 'application/custom+xml' ) ) True Non-textual types return ``False``: .. doctest:: Validation >>> print( detextive.is_textual_mimetype( 'image/jpeg' ) ) False >>> print( detextive.is_textual_mimetype( 'video/mp4' ) ) False >>> print( detextive.is_textual_mimetype( 'application/octet-stream' ) ) False Edge Cases ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Empty and malformed MIME types: .. doctest:: Validation >>> print( detextive.is_textual_mimetype( '' ) ) False >>> print( detextive.is_textual_mimetype( 'invalid' ) ) False Text Reasonableness Testing ------------------------------------------------------------------------------- Validate that byte content represents textual data: .. doctest:: Validation >>> import detextive >>> >>> content = b'This is readable text with proper formatting.' >>> print( detextive.is_textual_content( content ) ) True Content with acceptable whitespace: .. doctest:: Validation >>> content = b'Line 1\n\tIndented line\nLast line' >>> print( detextive.is_textual_content( content ) ) True Rejecting Non-Textual Content ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Empty content is rejected: .. doctest:: Validation >>> print( detextive.is_textual_content( b'' ) ) False Non-textual content is rejected: .. doctest:: Validation >>> content = b'\x00\x01\x02\x03\x04\x05' >>> print( detextive.is_textual_content( content ) ) False Line Separator Detection =============================================================================== Detecting Line Endings ------------------------------------------------------------------------------- Detect line separators from byte content: .. doctest:: Detection >>> import detextive >>> >>> content = b'line1\nline2\nline3' >>> separator = detextive.LineSeparators.detect_bytes( content ) >>> print( separator ) LineSeparators.LF Windows line endings: .. doctest:: Detection >>> content = b'line1\r\nline2\r\nline3' >>> separator = detextive.LineSeparators.detect_bytes( content ) >>> print( separator ) LineSeparators.CRLF No line separators found: .. doctest:: Detection >>> content = b'just one line' >>> separator = detextive.LineSeparators.detect_bytes( content ) >>> print( separator ) None Line Ending Normalization =============================================================================== Universal Normalization ------------------------------------------------------------------------------- Convert all line endings to Unix format: .. doctest:: Conversion >>> import detextive >>> content = 'Line 1\r\nLine 2\rLine 3\nLine 4' >>> normalized = detextive.LineSeparators.normalize_universal( content ) >>> print( repr( normalized ) ) 'Line 1\nLine 2\nLine 3\nLine 4' Specific Line Ending Conversion ------------------------------------------------------------------------------- Convert specific line endings: .. doctest:: Conversion >>> content = 'First line\r\nSecond line' >>> result = detextive.LineSeparators.CRLF.normalize( content ) >>> print( repr( result ) ) 'First line\nSecond line' Convert Unix endings to platform-specific: .. doctest:: Conversion >>> content = 'First line\nSecond line' >>> result = detextive.LineSeparators.CRLF.nativize( content ) >>> print( repr( result ) ) 'First line\r\nSecond line' Error Handling =============================================================================== Exception Scenarios ------------------------------------------------------------------------------- The exception hierarchy follows standard patterns. Exception classes are available for handling error conditions: .. doctest:: Detection >>> import detextive >>> from detextive import exceptions >>> >>> print( hasattr( exceptions, 'TextualMimetypeInvalidity' ) ) True The exception hierarchy follows standard patterns: .. doctest:: Detection >>> print( issubclass( exceptions.TextualMimetypeInvalidity, exceptions.Omnierror ) ) True >>> print( issubclass( exceptions.Omnierror, exceptions.Omniexception ) ) True