API¶
Package detextive¶
Detects textual content.
Module detextive.charsets¶
Management of bytes array decoding via trial character sets.
- detextive.charsets.attempt_decodes(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), inference=absence.absent, supplement=absence.absent, location=absence.absent, validator=absence.absent)¶
Attempts to decode content with various character sets.
Will try character sets in the order specified by the trial codecs listed on the behaviors object.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors)
inference (str | absence.objects.AbsentSingleton)
supplement (str | absence.objects.AbsentSingleton)
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton)
validator (collections.abc.Callable[ [ str, detextive.core.CharsetResult ], None ] | absence.objects.AbsentSingleton)
- Return type:
- detextive.charsets.discover_os_charset_default()¶
Discovers default character set encoding from operating system.
- Return type:
- detextive.charsets.normalize_charset(charset, bom_cognizant=False)¶
Normalizes character set encoding names.
- detextive.charsets.normalize_charset_for_content(content, charset)¶
Normalizes charset reporting based on byte-order mark provenance.
- detextive.charsets.trial_decode_as_confident(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), inference=absence.absent, confidence=0.0, supplement=absence.absent, location=absence.absent)¶
Performs trial decode of content.
Considers desired trial decode behavior and detection confidence.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors)
inference (str | absence.objects.AbsentSingleton)
confidence (float)
supplement (str | absence.objects.AbsentSingleton)
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton)
- Return type:
Module detextive.core¶
Core types and behaviors.
- type detextive.core.BehaviorsArgument = detextive.core.Behaviors¶
- class detextive.core.BehaviorTristate(value)¶
Bases:
EnumWhen to apply behavior.
- Variables:
Never (detextive.core.BehaviorTristate)
AsNeeded (detextive.core.BehaviorTristate)
Always (detextive.core.BehaviorTristate)
- class detextive.core.Behaviors(*, bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=DetectFailureActions.Default, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=DetectFailureActions.Default, on_decode_error='strict', remove_bom=True, text_validate=BehaviorTristate.AsNeeded, text_validate_confidence=0.8, trial_codecs=(CodecSpecifiers.UserSupplement, 'utf-8', CodecSpecifiers.FromInference, CodecSpecifiers.OsDefault, CodecSpecifiers.PythonDefault), trial_decode=BehaviorTristate.AsNeeded, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False)¶
Bases:
DataclassObjectHow functions behave.
- Variables:
bytes_quantity_confidence_divisor (int) – Minimum number of bytes for full detection confidence.
charset_detect (bool) – Whether to detect charset from content.
charset_detectors_order (collections.abc.Sequence[ str ]) – Order in which charset detectors should be applied.
charset_on_detect_failure (detextive.core.DetectFailureActions) – Action to take on charset detection failure.
mimetype_detect (bool) – Whether to detect MIME type from content.
mimetype_detectors_order (collections.abc.Sequence[ str ]) – Order in which MIME type detectors should be applied.
mimetype_on_detect_failure (detextive.core.DetectFailureActions) – Action to take on MIME type detection failure.
on_decode_error (str) –
Response to charset decoding errors.
Standard values are ‘ignore’, ‘replace’, and ‘strict’. Can also be any other name which has been registered via the ‘register_error’ function in the Python standard library ‘codecs’ module.
remove_bom (bool) – Remove byte-ordering mark?
text_validate (detextive.core.BehaviorTristate) – When to validate text.
text_validate_confidence (float) – Minimum confidence to skip text validation.
trial_codecs (collections.abc.Sequence[ str | detextive.core.CodecSpecifiers ]) – Sequence of codec names or specifiers.
trial_decode (detextive.core.BehaviorTristate) – When to perform trial decode of content with charset.
trial_decode_confidence (float) – Minimum confidence to skip trial decode.
utf_16_32_requires_byte_order (bool) – Require explicit byte order for BOM-less generic UTF-16/32?
- class detextive.core.CharsetResult(*, charset, confidence)¶
Bases:
DataclassObjectCharacter set encoding with detection confidence.
- class detextive.core.CodecSpecifiers(value)¶
Bases:
EnumSpecifiers for dynamic codecs.
- Variables:
FromInference (detextive.core.CodecSpecifiers)
OsDefault (detextive.core.CodecSpecifiers)
PythonDefault (detextive.core.CodecSpecifiers)
UserSupplement (detextive.core.CodecSpecifiers)
- class detextive.core.DetectFailureActions(value)¶
Bases:
EnumPossible responses to detection failure.
- Variables:
Default (detextive.core.DetectFailureActions)
- class detextive.core.MimetypeResult(*, mimetype, confidence)¶
Bases:
DataclassObjectMIME type with detection confidence.
- detextive.core.confidence_from_bytes_quantity(content, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False))¶
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors)
- Return type:
Module detextive.decoders¶
Conversion of bytes arrays to Unicode text.
- class detextive.decoders.DecodeInformResult(*, text, charset, mimetype, linesep)¶
Bases:
DataclassObjectDecoded text with supplemental inference metadata.
- Variables:
text (str) – Decoded text content.
charset (detextive.core.CharsetResult) – Charset used for decoding.
mimetype (detextive.core.MimetypeResult) – Inferred MIME type metadata.
linesep (detextive.lineseparators.LineSeparators | None) – Detected line separator from content sample.
- detextive.decoders.decode(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), profile=Profile(acceptable_characters=frozenset({'\u202d', '\u2068', '\u061c', '\u2066', '\t', '\u200c', '\u2067', '\r', '\n', '\u200d', '\u2069', '\u202b', '\u200f', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Cs', 'Co'}), rejectables_ratio_max=0.0, sample_quantity=8192), http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent)¶
Decodes bytes array to Unicode text.
Uses trial decoding and validation; does not provide default-return semantics. The
charset_supplementparameter is a trial hint and not a fallback return value.- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
profile (detextive.validation.Profile) – Text validation profile for content analysis.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
charset_supplement (str | absence.objects.AbsentSingleton) – User-supplied character set hint for trial decode attempts.
- Return type:
- detextive.decoders.decode_inform(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), profile=Profile(acceptable_characters=frozenset({'\u202d', '\u2068', '\u061c', '\u2066', '\t', '\u200c', '\u2067', '\r', '\n', '\u200d', '\u2069', '\u202b', '\u200f', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Cs', 'Co'}), rejectables_ratio_max=0.0, sample_quantity=8192), mimetype_default='text/plain', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent)¶
Decodes bytes and returns supplemental inference metadata.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
profile (detextive.validation.Profile) – Text validation profile for content analysis.
mimetype_default (str) – Fallback MIME type returned on inference/detection failure.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
charset_supplement (str | absence.objects.AbsentSingleton) – User-supplied character set hint for trial decode attempts.
- Return type:
Module detextive.detectors¶
Core detection function implementations.
- type detextive.detectors.CharsetDetector = collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.CharsetResult | builtins.NotImplementedType]¶
- type detextive.detectors.MimetypeDetector = collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.MimetypeResult | builtins.NotImplementedType]¶
- detextive.detectors.charset_detectors: accretive.dictionaries.Dictionary[str, collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.CharsetResult | builtins.NotImplementedType]] = accretive.dictionaries.Dictionary( {'chardet': <function _detect_via_chardet at 0x7f20c9c31120>, 'charset-normalizer': <function _detect_via_charset_normalizer at 0x7f20c9d6a560>} )¶
- detextive.detectors.mimetype_detectors: accretive.dictionaries.Dictionary[str, collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.MimetypeResult | builtins.NotImplementedType]] = accretive.dictionaries.Dictionary( {'magic': <function _detect_via_magic at 0x7f20c9da3400>, 'puremagic': <function _detect_via_puremagic at 0x7f20c9c81750>} )¶
- detextive.detectors.detect_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), default='utf-8', supplement=absence.absent, mimetype=absence.absent, location=absence.absent)¶
Detects character set.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Fallback character set returned on inference/detection failure.
supplement (str | absence.objects.AbsentSingleton) – User-supplied character set hint for trial decode attempts.
mimetype (str | absence.objects.AbsentSingleton) – MIME type hint to influence character set detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
str | None
- detextive.detectors.detect_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), default='utf-8', supplement=absence.absent, mimetype=absence.absent, location=absence.absent)¶
Detects character set candidates with confidence scores.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Fallback character set returned on inference/detection failure.
supplement (str | absence.objects.AbsentSingleton) – User-supplied character set hint for trial decode attempts.
mimetype (str | absence.objects.AbsentSingleton) – MIME type hint to influence character set detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
- detextive.detectors.detect_mimetype(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), default='application/octet-stream', charset=absence.absent, location=absence.absent)¶
Detects most probable MIME type.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Fallback MIME type returned on inference/detection failure.
charset (str | absence.objects.AbsentSingleton) – Character set hint to influence MIME type detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
- detextive.detectors.detect_mimetype_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), default='application/octet-stream', charset=absence.absent, location=absence.absent)¶
Detects MIME type candidates with confidence scores.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Fallback MIME type returned on inference/detection failure.
charset (str | absence.objects.AbsentSingleton) – Character set hint to influence MIME type detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
Module detextive.exceptions¶
Family of exceptions for package API.
- exception detextive.exceptions.BehaviorsInvalidity(attribute, expectation)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.CharsetDetectFailure(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.CharsetInferFailure(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.ContentDecodeFailure(charset, location=absence.absent)¶
Bases:
Omnierror,UnicodeError
- exception detextive.exceptions.ContentDecodeImpossibility(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.MimetypeDetectFailure(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.MimetypeInferFailure(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.Omnierror(*posargs, **nomargs)¶
Bases:
Omniexception,ExceptionBase for error exceptions raised by package API.
- exception detextive.exceptions.Omniexception(*posargs, **nomargs)¶
Bases:
OmniexceptionBase for all exceptions raised by package API.
- exception detextive.exceptions.TextInvalidity(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.TextualMimetypeInvalidity(mimetype, location=absence.absent)¶
Bases:
Omnierror,ValueError
Module detextive.inference¶
Core detection function implementations.
- detextive.inference.infer_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), charset_default='utf-8', http_content_type=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent, location=absence.absent)¶
Infers charset through various means.
charset_defaultis the returned fallback when inference cannot determine another charset.charset_supplementis a user-supplied hint used during inference/validation.- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Fallback character set returned on inference/detection failure.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
charset_supplement (str | absence.objects.AbsentSingleton) – User-supplied character set hint for trial decode attempts.
mimetype_supplement (str | absence.objects.AbsentSingleton) – User-supplied MIME type hint for inference.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
str | None
- detextive.inference.infer_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), charset_default='utf-8', http_content_type=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent, location=absence.absent)¶
Infers charset with confidence level through various means.
charset_defaultis the returned fallback when inference cannot determine another charset.charset_supplementis a user-supplied hint used during inference/validation.http_content_typeis parsed when supplied, independent of detector enablement behavior.- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Fallback character set returned on inference/detection failure.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
charset_supplement (str | absence.objects.AbsentSingleton) – User-supplied character set hint for trial decode attempts.
mimetype_supplement (str | absence.objects.AbsentSingleton) – User-supplied MIME type hint for inference.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
- detextive.inference.infer_mimetype_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)¶
Infers MIME type and charset through various means.
*_defaultvalues are returned fallbacks on inference failure.*_supplementvalues are user-supplied hints used to guide inference before fallback behavior is applied.- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Fallback character set returned on inference/detection failure.
mimetype_default (str) – Fallback MIME type returned on inference/detection failure.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
charset_supplement (str | absence.objects.AbsentSingleton) – User-supplied character set hint for trial decode attempts.
mimetype_supplement (str | absence.objects.AbsentSingleton) – User-supplied MIME type hint for inference.
- Return type:
- detextive.inference.infer_mimetype_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)¶
Infers MIME type and charset through various means.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Fallback character set returned on inference/detection failure.
mimetype_default (str) – Fallback MIME type returned on inference/detection failure.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
charset_supplement (str | absence.objects.AbsentSingleton) – User-supplied character set hint for trial decode attempts.
mimetype_supplement (str | absence.objects.AbsentSingleton) – User-supplied MIME type hint for inference.
- Return type:
tuple[ detextive.core.MimetypeResult, detextive.core.CharsetResult ]
- detextive.inference.parse_http_content_type(http_content_type)¶
Parses RFC 9110 HTTP Content-Type header.
Returns normalized MIME type and charset, if able to be extracted. Marks either as absent, if not able to be extracted.
- Parameters:
http_content_type (str)
- Return type:
tuple[ str | absence.objects.AbsentSingleton, str | None | absence.objects.AbsentSingleton ]
- detextive.inference.validate_httpct_charset(content, charset, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False))¶
- Parameters:
content (bytes) – Raw byte content for analysis.
charset (str)
behaviors (detextive.core.Behaviors)
- Return type:
detextive.core.CharsetResult | absence.objects.AbsentSingleton
Module detextive.lineseparators¶
Line separator enumeration and utilities.
- class detextive.lineseparators.LineSeparators(value)¶
Bases:
EnumLine separators for cross-platform text processing.
- Variables:
- classmethod detect_bytes(content, limit=1024)¶
Detects line separator from byte content sample.
Returns detected LineSeparators enum member or None.
- classmethod detect_text(text, limit=1024)¶
Detects line separator from text (Unicode string).
Returns detected LineSeparators enum member or None.
- nativize(content)¶
Converts Unix LF to this platform’s line separator.
- normalize(content)¶
Normalizes specific line separator to Unix LF format.
- classmethod normalize_universal(content)¶
Normalizes all line separators to Unix LF format.
Module detextive.mimetypes¶
Determination of MIME types and textuality thereof.
- detextive.mimetypes.is_textual_mimetype(mimetype)¶
Checks if MIME type represents textual content.
- detextive.mimetypes.mimetype_from_location(location)¶
Determines MIME type from file location.
- Parameters:
location (str | os.PathLike[ str ]) – Local filesystem location or URL for context.
- Return type:
Module detextive.nomina¶
Common names and type aliases.
- type detextive.nomina.Location = str | os.PathLike[str]¶
- type detextive.nomina.CharsetAssumptionArgument = str | absence.objects.AbsentSingleton¶
- type detextive.nomina.CharsetSupplementArgument = str | absence.objects.AbsentSingleton¶
- type detextive.nomina.HttpContentTypeArgument = str | absence.objects.AbsentSingleton¶
- type detextive.nomina.LocationArgument = str | os.PathLike[str] | absence.objects.AbsentSingleton¶
- type detextive.nomina.MimetypeAssumptionArgument = str | absence.objects.AbsentSingleton¶
- type detextive.nomina.MimetypeSupplementArgument = str | absence.objects.AbsentSingleton¶
Module detextive.validation¶
Validation of textual content.
- type detextive.validation.ProfileArgument = detextive.validation.Profile¶
- detextive.validation.PROFILE_PRINTER_SAFE: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202d', '\u061c', '\u2066', '\n', '\u2069', '\u200f', '\u2068', '\u200c', '\t', '\x0c', '\u2067', '\r', '\u200d', '\u202b', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Zp', 'Co', 'Zl', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶
- detextive.validation.PROFILE_TEXTUAL: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202d', '\u2068', '\u061c', '\u2066', '\t', '\u200c', '\u2067', '\r', '\n', '\u200d', '\u2069', '\u202b', '\u200f', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Cs', 'Co'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶
- detextive.validation.PROFILE_TERMINAL_SAFE: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202d', '\u2068', '\u061c', '\u2066', '\t', '\u200c', '\u2067', '\r', '\n', '\u200d', '\u2069', '\u202b', '\u200f', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Zp', 'Co', 'Zl', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶
- detextive.validation.PROFILE_TERMINAL_SAFE_ANSI: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202d', '\u061c', '\u2066', '\n', '\u2069', '\u200f', '\u2068', '\u200c', '\t', '\x1b', '\u2067', '\r', '\u200d', '\u202b', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Zp', 'Co', 'Zl', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶
- class detextive.validation.Profile(*, acceptable_characters=frozenset({'\t', '\n', '\r', '\u061c', '\u200c', '\u200d', '\u200e', '\u200f', '\u202a', '\u202b', '\u202c', '\u202d', '\u202e', '\u2066', '\u2067', '\u2068', '\u2069'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Co', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶
Bases:
DataclassObjectConfiguration for text validation heuristics.
- Variables:
acceptable_characters (collections.abc.Set[ str ]) – Set of characters which are always considered valid.
check_bom (bool) – Allow leading BOM; reject embedded BOMs.
printables_ratio_min (float) – Minimum ratio of printable characters to total characters.
rejectable_characters (collections.abc.Set[ str ]) – Set of characters which are always considered invalid.
rejectable_families (collections.abc.Set[ str ]) – Set of Unicode categories which are always considered invalid.
rejectables_ratio_max (float) – Maximum ratio of rejectable characters to total characters.
sample_quantity (int | None) – Number of characters to sample.
- detextive.validation.is_valid_text(text, /, profile=Profile(acceptable_characters=frozenset({'\u202d', '\u2068', '\u061c', '\u2066', '\t', '\u200c', '\u2067', '\r', '\n', '\u200d', '\u2069', '\u202b', '\u200f', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Cs', 'Co'}), rejectables_ratio_max=0.0, sample_quantity=8192))¶
Is content valid against profile?
- Parameters:
text (str)
profile (detextive.validation.Profile)
- Return type: