Test Content Patterns Specification

Overview

This document specifies a centralized test content patterns module providing curated byte sequences for comprehensive testing without filesystem dependencies. The patterns support systematic testing of charset detection, MIME type detection, validation, and cross-platform compatibility scenarios.

Module Structure

Location: tests/test_000_detextive/patterns.py

The patterns module provides categorized byte sequences with known expected outcomes for deterministic testing across all detection components.

Charset Detection Patterns

UTF-8 Samples:

UTF8_BASIC = b'Hello, world!'
UTF8_WITH_BOM = b'\xef\xbb\xbfHello, world!'
UTF8_EMOJI = b'Hello \xf0\x9f\x91\x8b world!'
UTF8_MULTIBYTE = b'Caf\xc3\xa9 na\xc3\xafve r\xc3\xa9sum\xc3\xa9'
UTF8_ACCENTED = b'\xc3\xa9\xc3\xa8\xc3\xa0\xc3\xa7'

ASCII-Compatible Samples:

ASCII_BASIC = b'Simple ASCII text without special characters'
ASCII_PRINTABLE = b'!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
ASCII_WHITESPACE = b'Line 1\n\tIndented line\r\nWindows line'

Latin-1 Samples:

LATIN1_BASIC = b'Caf\xe9 na\xefve r\xe9sum\xe9'  # ISO-8859-1
LATIN1_EXTENDED = b'\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf'

Windows-1252 Samples:

CP1252_QUOTES = b'\x93smart quotes\x94 and \x96dashes\x97'
CP1252_CURRENCY = b'Price: \x80 12.99'  # Euro symbol

Ambiguous Content:

AMBIGUOUS_ASCII = b'This could be any ASCII-compatible charset'
AMBIGUOUS_LATIN = b'\xe9\xe8\xe0'  # Could be Latin-1 or CP1252

Malformed Content:

INVALID_UTF8 = b'\xff\xfe\xfd'  # Invalid UTF-8 sequences
TRUNCATED_UTF8 = b'Valid start \xc3'  # Incomplete multibyte
MIXED_ENCODING = b'ASCII \xc3\xa9 then \xe9'  # Mixed UTF-8/Latin-1

MIME Type Detection Patterns

Text Content:

TEXT_PLAIN = b'This is plain text content for testing purposes.'
TEXT_HTML = b'<html><head><title>Test</title></head><body>Content</body></html>'
TEXT_CSS = b'body { margin: 0; padding: 0; background: #fff; }'
TEXT_JAVASCRIPT = b'function test() { return "hello world"; }'
TEXT_XML = b'<?xml version="1.0"?><root><element>value</element></root>'

JSON Content:

JSON_SIMPLE = b'{"key": "value", "number": 42, "array": [1, 2, 3]}'
JSON_UNICODE = b'{"message": "\u00c9\u00e9\u00e8\u00e0", "emoji": "\ud83d\udc4b"}'
JSON_NESTED = b'{"outer": {"inner": {"deep": "value"}}, "list": [{"item": 1}]}'

Binary Content with Magic Bytes:

# Image formats
JPEG_HEADER = b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00'
PNG_HEADER = b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01'
GIF_HEADER = b'GIF89a\x01\x00\x01\x00\x00\x00\x00'

# Archive formats
ZIP_HEADER = b'PK\x03\x04\x14\x00\x00\x00\x08\x00'
PDF_HEADER = b'%PDF-1.4\n%\xe2\xe3\xcf\xd3\n'

# Executable formats
PE_HEADER = b'MZ\x90\x00\x03\x00\x00\x00\x04\x00\x00\x00\xff\xff'
ELF_HEADER = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Cross-Platform Considerations:

# Content that python-magic vs python-magic-bin detect differently
JSON_AMBIGUOUS = b'{"data": "value"}'  # May be application/json or text/plain
XML_SIMPLE = b'<note><body>content</body></note>'  # May vary by platform

Line Separator Patterns

Platform-Specific Line Endings:

UNIX_LINES = b'line1\nline2\nline3\n'
WINDOWS_LINES = b'line1\r\nline2\r\nline3\r\n'
MAC_CLASSIC_LINES = b'line1\rline2\rline3\r'

Mixed Line Endings:

MIXED_UNIX_WINDOWS = b'line1\nline2\r\nline3\n'
MIXED_ALL_TYPES = b'line1\nline2\r\nline3\rline4\n'
CONSECUTIVE_SEPARATORS = b'line1\n\nline2\r\n\r\nline3'

Edge Cases:

NO_LINE_ENDINGS = b'single line without any separators'
ONLY_SEPARATORS = b'\n\r\n\r'
CR_NOT_CRLF = b'line1\rX\rline2'  # CR followed by non-LF

Content Length Patterns

Confidence Testing:

EMPTY_CONTENT = b''
MINIMAL_CONTENT = b'a'
SHORT_CONTENT = b'Short content for low confidence testing'
MEDIUM_CONTENT = b'A' * 512  # Half of default confidence divisor
LONG_CONTENT = b'A' * 1024   # Full confidence threshold
VERY_LONG_CONTENT = b'A' * 2048  # Above confidence threshold

Repeated Patterns:

REPEATED_CHAR = b'a' * 100
REPEATED_SEQUENCE = b'abc' * 100
REPEATED_UTF8 = b'\xc3\xa9' * 100  # Repeated é

Validation Patterns

Textual Content:

REASONABLE_TEXT = b'This is reasonable text with proper punctuation.'
WHITESPACE_HEAVY = b'   \t\n\r   \t\n\r   '
CONTROL_CHARS = b'\x01\x02\x03\x04\x05'
MIXED_REASONABLE = b'Normal text \x09 with some \x0a control chars'

Non-Textual Content:

BINARY_DATA = bytes(range(256))  # All possible byte values
NULL_HEAVY = b'\x00' * 50
HIGH_BYTES = bytes(range(128, 256))

Error Condition Patterns

Detection Failure Scenarios:

UNDETECTABLE_CHARSET = b'\x80\x81\x82\x83'  # Ambiguous bytes
UNDETECTABLE_MIMETYPE = b'UNKN\x00\x01\x02\x03'  # No clear magic
CONFLICTING_INDICATORS = b'{\x80\x81\x82\x83}'  # JSON-like but invalid UTF-8

Exception Trigger Patterns:

DECODE_FAILURE_UTF8 = b'Valid start \xff\xfe then invalid'
DECODE_FAILURE_LATIN1 = b'\xff\xfe\xfd'  # Invalid for most charsets except Latin-1

Location Context Patterns

File Extension Hints:

EXTENSIONS = {
    'text': ['.txt', '.log', '.md', '.rst'],
    'code': ['.py', '.js', '.css', '.html', '.xml'],
    'data': ['.json', '.csv', '.yaml', '.toml'],
    'binary': ['.jpg', '.png', '.pdf', '.zip', '.exe'],
    'ambiguous': ['.bin', '.dat', '.tmp', ''],
}

URL Context Patterns:

URLS = [
    'http://example.com/document.txt',
    'https://api.example.com/data.json',
    'file:///path/to/local/file.py',
    '/absolute/path/file.log',
    'relative/path/file.md',
]

Windows Compatibility Patterns

Python-Magic vs Python-Magic-Bin Differences:

# Content that detects differently on Windows vs Unix
JSON_PLATFORM_VARIANT = b'{"test": "data"}'
# Expected: application/json (Unix) vs text/plain (Windows)

XML_PLATFORM_VARIANT = b'<test>data</test>'
# Expected: application/xml (Unix) vs text/xml (Windows)

Cygwin-Specific Considerations:

LARGE_CONTENT = b'A' * 10000  # Test buffer handling
UNICODE_HEAVY = 'Test with unicode: ' + '🌟' * 100
UNICODE_HEAVY_BYTES = UNICODE_HEAVY.encode('utf-8')

Pattern Metadata

Each pattern includes metadata for expected outcomes:

PATTERN_METADATA = {
    'UTF8_BASIC': {
        'expected_charset': 'utf-8',
        'expected_mimetype': 'text/plain',
        'confidence_minimum': 0.8,
        'is_textual': True,
        'line_separator': None,
    },
    'JPEG_HEADER': {
        'expected_charset': None,
        'expected_mimetype': 'image/jpeg',
        'confidence_minimum': 0.9,
        'is_textual': False,
        'line_separator': None,
    },
    # ... Additional metadata for all patterns
}

Usage Guidelines

Test Pattern Selection:

# Import patterns in test modules
from .patterns import UTF8_BASIC, JPEG_HEADER, PATTERN_METADATA

# Use with expected outcomes
def test_charset_detection():
    result = detect_charset(UTF8_BASIC)
    expected = PATTERN_METADATA['UTF8_BASIC']['expected_charset']
    assert result == expected

Cross-Platform Testing:

# Use platform variants for Windows compatibility
def test_json_detection_cross_platform():
    result = detect_mimetype(JSON_PLATFORM_VARIANT)
    # Accept either Unix or Windows detection
    assert result in ['application/json', 'text/plain']

Property-Based Testing Integration:

# Combine with hypothesis for edge case generation
@given(content=st.sampled_from([UTF8_BASIC, LATIN1_BASIC, ASCII_BASIC]))
def test_charset_detection_deterministic(content):
    result1 = detect_charset(content)
    result2 = detect_charset(content)
    assert result1 == result2

Implementation Notes

  • All patterns are defined as module-level byte constants

  • Metadata dictionary provides expected outcomes for assertions

  • Patterns cover both positive cases (successful detection) and negative cases (detection failures)

  • Cross-platform variants account for python-magic vs python-magic-bin differences

  • Content length patterns enable confidence scoring validation

  • Location patterns support context-aware detection testing