.. vim: set fileencoding=utf-8: .. -*- coding: utf-8 -*- .. +--------------------------------------------------------------------------+ | | | Licensed under the Apache License, Version 2.0 (the "License"); | | you may not use this file except in compliance with the License. | | You may obtain a copy of the License at | | | | http://www.apache.org/licenses/LICENSE-2.0 | | | | Unless required by applicable law or agreed to in writing, software | | distributed under the License is distributed on an "AS IS" BASIS, | | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | | See the License for the specific language governing permissions and | | limitations under the License. | | | +--------------------------------------------------------------------------+ ******************************************************************************* Test Content Patterns Specification ******************************************************************************* Overview =============================================================================== This document specifies a centralized test content patterns module providing curated byte sequences for comprehensive testing without filesystem dependencies. The patterns support systematic testing of charset detection, MIME type detection, validation, and cross-platform compatibility scenarios. Module Structure =============================================================================== Location: ``tests/test_000_detextive/patterns.py`` The patterns module provides categorized byte sequences with known expected outcomes for deterministic testing across all detection components. Charset Detection Patterns ------------------------------------------------------------------------------- **UTF-8 Samples**:: UTF8_BASIC = b'Hello, world!' UTF8_WITH_BOM = b'\xef\xbb\xbfHello, world!' UTF8_EMOJI = b'Hello \xf0\x9f\x91\x8b world!' UTF8_MULTIBYTE = b'Caf\xc3\xa9 na\xc3\xafve r\xc3\xa9sum\xc3\xa9' UTF8_ACCENTED = b'\xc3\xa9\xc3\xa8\xc3\xa0\xc3\xa7' **ASCII-Compatible Samples**:: ASCII_BASIC = b'Simple ASCII text without special characters' ASCII_PRINTABLE = b'!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~' ASCII_WHITESPACE = b'Line 1\n\tIndented line\r\nWindows line' **Latin-1 Samples**:: LATIN1_BASIC = b'Caf\xe9 na\xefve r\xe9sum\xe9' # ISO-8859-1 LATIN1_EXTENDED = b'\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf' **Windows-1252 Samples**:: CP1252_QUOTES = b'\x93smart quotes\x94 and \x96dashes\x97' CP1252_CURRENCY = b'Price: \x80 12.99' # Euro symbol **Ambiguous Content**:: AMBIGUOUS_ASCII = b'This could be any ASCII-compatible charset' AMBIGUOUS_LATIN = b'\xe9\xe8\xe0' # Could be Latin-1 or CP1252 **Malformed Content**:: INVALID_UTF8 = b'\xff\xfe\xfd' # Invalid UTF-8 sequences TRUNCATED_UTF8 = b'Valid start \xc3' # Incomplete multibyte MIXED_ENCODING = b'ASCII \xc3\xa9 then \xe9' # Mixed UTF-8/Latin-1 MIME Type Detection Patterns ------------------------------------------------------------------------------- **Text Content**:: TEXT_PLAIN = b'This is plain text content for testing purposes.' TEXT_HTML = b'TestContent' TEXT_CSS = b'body { margin: 0; padding: 0; background: #fff; }' TEXT_JAVASCRIPT = b'function test() { return "hello world"; }' TEXT_XML = b'value' **JSON Content**:: JSON_SIMPLE = b'{"key": "value", "number": 42, "array": [1, 2, 3]}' JSON_UNICODE = b'{"message": "\u00c9\u00e9\u00e8\u00e0", "emoji": "\ud83d\udc4b"}' JSON_NESTED = b'{"outer": {"inner": {"deep": "value"}}, "list": [{"item": 1}]}' **Binary Content with Magic Bytes**:: # Image formats JPEG_HEADER = b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00' PNG_HEADER = b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01' GIF_HEADER = b'GIF89a\x01\x00\x01\x00\x00\x00\x00' # Archive formats ZIP_HEADER = b'PK\x03\x04\x14\x00\x00\x00\x08\x00' PDF_HEADER = b'%PDF-1.4\n%\xe2\xe3\xcf\xd3\n' # Executable formats PE_HEADER = b'MZ\x90\x00\x03\x00\x00\x00\x04\x00\x00\x00\xff\xff' ELF_HEADER = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' **Cross-Platform Considerations**:: # Content that python-magic vs python-magic-bin detect differently JSON_AMBIGUOUS = b'{"data": "value"}' # May be application/json or text/plain XML_SIMPLE = b'content' # May vary by platform Line Separator Patterns ------------------------------------------------------------------------------- **Platform-Specific Line Endings**:: UNIX_LINES = b'line1\nline2\nline3\n' WINDOWS_LINES = b'line1\r\nline2\r\nline3\r\n' MAC_CLASSIC_LINES = b'line1\rline2\rline3\r' **Mixed Line Endings**:: MIXED_UNIX_WINDOWS = b'line1\nline2\r\nline3\n' MIXED_ALL_TYPES = b'line1\nline2\r\nline3\rline4\n' CONSECUTIVE_SEPARATORS = b'line1\n\nline2\r\n\r\nline3' **Edge Cases**:: NO_LINE_ENDINGS = b'single line without any separators' ONLY_SEPARATORS = b'\n\r\n\r' CR_NOT_CRLF = b'line1\rX\rline2' # CR followed by non-LF Content Length Patterns ------------------------------------------------------------------------------- **Confidence Testing**:: EMPTY_CONTENT = b'' MINIMAL_CONTENT = b'a' SHORT_CONTENT = b'Short content for low confidence testing' MEDIUM_CONTENT = b'A' * 512 # Half of default confidence divisor LONG_CONTENT = b'A' * 1024 # Full confidence threshold VERY_LONG_CONTENT = b'A' * 2048 # Above confidence threshold **Repeated Patterns**:: REPEATED_CHAR = b'a' * 100 REPEATED_SEQUENCE = b'abc' * 100 REPEATED_UTF8 = b'\xc3\xa9' * 100 # Repeated é Validation Patterns ------------------------------------------------------------------------------- **Textual Content**:: REASONABLE_TEXT = b'This is reasonable text with proper punctuation.' WHITESPACE_HEAVY = b' \t\n\r \t\n\r ' CONTROL_CHARS = b'\x01\x02\x03\x04\x05' MIXED_REASONABLE = b'Normal text \x09 with some \x0a control chars' **Non-Textual Content**:: BINARY_DATA = bytes(range(256)) # All possible byte values NULL_HEAVY = b'\x00' * 50 HIGH_BYTES = bytes(range(128, 256)) Error Condition Patterns ------------------------------------------------------------------------------- **Detection Failure Scenarios**:: UNDETECTABLE_CHARSET = b'\x80\x81\x82\x83' # Ambiguous bytes UNDETECTABLE_MIMETYPE = b'UNKN\x00\x01\x02\x03' # No clear magic CONFLICTING_INDICATORS = b'{\x80\x81\x82\x83}' # JSON-like but invalid UTF-8 **Exception Trigger Patterns**:: DECODE_FAILURE_UTF8 = b'Valid start \xff\xfe then invalid' DECODE_FAILURE_LATIN1 = b'\xff\xfe\xfd' # Invalid for most charsets except Latin-1 Location Context Patterns ------------------------------------------------------------------------------- **File Extension Hints**:: EXTENSIONS = { 'text': ['.txt', '.log', '.md', '.rst'], 'code': ['.py', '.js', '.css', '.html', '.xml'], 'data': ['.json', '.csv', '.yaml', '.toml'], 'binary': ['.jpg', '.png', '.pdf', '.zip', '.exe'], 'ambiguous': ['.bin', '.dat', '.tmp', ''], } **URL Context Patterns**:: URLS = [ 'http://example.com/document.txt', 'https://api.example.com/data.json', 'file:///path/to/local/file.py', '/absolute/path/file.log', 'relative/path/file.md', ] Windows Compatibility Patterns ------------------------------------------------------------------------------- **Python-Magic vs Python-Magic-Bin Differences**:: # Content that detects differently on Windows vs Unix JSON_PLATFORM_VARIANT = b'{"test": "data"}' # Expected: application/json (Unix) vs text/plain (Windows) XML_PLATFORM_VARIANT = b'data' # Expected: application/xml (Unix) vs text/xml (Windows) **Cygwin-Specific Considerations**:: LARGE_CONTENT = b'A' * 10000 # Test buffer handling UNICODE_HEAVY = 'Test with unicode: ' + '🌟' * 100 UNICODE_HEAVY_BYTES = UNICODE_HEAVY.encode('utf-8') Pattern Metadata =============================================================================== Each pattern includes metadata for expected outcomes:: PATTERN_METADATA = { 'UTF8_BASIC': { 'expected_charset': 'utf-8', 'expected_mimetype': 'text/plain', 'confidence_minimum': 0.8, 'is_textual': True, 'line_separator': None, }, 'JPEG_HEADER': { 'expected_charset': None, 'expected_mimetype': 'image/jpeg', 'confidence_minimum': 0.9, 'is_textual': False, 'line_separator': None, }, # ... Additional metadata for all patterns } Usage Guidelines =============================================================================== **Test Pattern Selection**:: # Import patterns in test modules from .patterns import UTF8_BASIC, JPEG_HEADER, PATTERN_METADATA # Use with expected outcomes def test_charset_detection(): result = detect_charset(UTF8_BASIC) expected = PATTERN_METADATA['UTF8_BASIC']['expected_charset'] assert result == expected **Cross-Platform Testing**:: # Use platform variants for Windows compatibility def test_json_detection_cross_platform(): result = detect_mimetype(JSON_PLATFORM_VARIANT) # Accept either Unix or Windows detection assert result in ['application/json', 'text/plain'] **Property-Based Testing Integration**:: # Combine with hypothesis for edge case generation @given(content=st.sampled_from([UTF8_BASIC, LATIN1_BASIC, ASCII_BASIC])) def test_charset_detection_deterministic(content): result1 = detect_charset(content) result2 = detect_charset(content) assert result1 == result2 Implementation Notes =============================================================================== - All patterns are defined as module-level byte constants - Metadata dictionary provides expected outcomes for assertions - Patterns cover both positive cases (successful detection) and negative cases (detection failures) - Cross-platform variants account for python-magic vs python-magic-bin differences - Content length patterns enable confidence scoring validation - Location patterns support context-aware detection testing