detextive¶
🕵️ A Python library which provides consolidated text detection capabilities for reliable content analysis. Offers MIME type detection, character set detection, and line separator processing.
Key Features ⭐¶
- 🔍 MIME Type Detection
Intelligent content-based detection using magic bytes with file extension fallback for comprehensive format identification.
- 📝 Character Encoding Detection
Statistical analysis with UTF-8 optimization and validation through decode operations for reliable text processing.
- 📄 Line Separator Processing
Cross-platform line ending detection and normalization supporting CR, LF, and CRLF formats with mixed-content handling.
- ✅ Textual Content Validation
Smart classification of MIME types and content reasonableness assessment using control character and printability heuristics.
Installation 📦¶
Method: Install Python Package¶
Install via uv pip
command:
uv pip install detextive
Or, install via pip:
pip install detextive
Examples 💡¶
Basic Usage¶
MIME Type and Charset Detection:
Load your content as bytes:
import detextive
with open( 'document.txt', 'rb' ) as file:
content = file.read( )
You can detect MIME type and charset individually:
mimetype = detextive.detect_mimetype( content, location = 'document.txt' )
charset = detextive.detect_charset( content )
Or use combined inference for better accuracy:
mimetype, charset = detextive.infer_mimetype_charset(
content, location = 'document.txt' )
print( "Detected: {mimetype} with {charset} encoding".format(
mimetype = mimetype, charset = charset ) )
Line Separator Processing:
Detect line separators in mixed content:
import detextive
content = 'Line 1\r\nLine 2\rLine 3\n'
separator = detextive.LineSeparators.detect_bytes( content.encode( ) )
Normalize line separators to Python standard:
normalized = detextive.LineSeparators.normalize_universal( content )
Convert to platform-specific line separators:
native = detextive.LineSeparators.CRLF.nativize( normalized )
Content Classification:
Check if MIME types represent textual content:
import detextive
detextive.is_textual_mimetype( 'application/json' ) # True
detextive.is_textual_mimetype( 'image/jpeg' ) # False
Validate that decoded text content is reasonable:
text = "Hello world!"
detextive.is_valid_text( text ) # True
Binary data that might decode as text but isn’t valid fails validation:
binary_as_text = "Config file\x00\x00\x00data"
detextive.is_valid_text( binary_as_text ) # False
High-Level Decoding:
For complete bytes-to-text processing with automatic charset detection and validation:
import detextive
with open( 'document.txt', 'rb' ) as file:
content = file.read( )
text = detextive.decode( content, location = 'document.txt' )
print( f"Decoded text: {text}" )
Contribution 🤝¶
Contribution to this project is welcome! However, it must follow the code of conduct for the project.
Please file bug reports and feature requests in the issue tracker or submit pull requests to improve the source code or documentation.
For development guidance and standards, please see the development guide.