API¶
Package detextive¶
Detects textual content.
Module detextive.charsets¶
Management of bytes array decoding via trial character sets.
- detextive.charsets.attempt_decodes(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), inference=absence.absent, supplement=absence.absent, location=absence.absent)¶
Attempts to decode content with various character sets.
Will try character sets in the order specified by the trial codecs listed on the behaviors object.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors)
inference (str | absence.objects.AbsentSingleton)
supplement (str | absence.objects.AbsentSingleton)
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton)
- Return type:
- detextive.charsets.discover_os_charset_default()¶
Discovers default character set encoding from operating system.
- Return type:
- detextive.charsets.normalize_charset(charset)¶
Normalizes character set encoding names.
- detextive.charsets.trial_decode_as_confident(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), inference=absence.absent, confidence=0.0, supplement=absence.absent, location=absence.absent)¶
Performs trial decode of content.
Considers desired trial decode behavior and detection confidence.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors)
inference (str | absence.objects.AbsentSingleton)
confidence (float)
supplement (str | absence.objects.AbsentSingleton)
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton)
- Return type:
Module detextive.core¶
Core types and behaviors.
- type detextive.core.BehaviorsArgument = detextive.core.Behaviors¶
- class detextive.core.BehaviorTristate(value)¶
Bases:
EnumWhen to apply behavior.
- Variables:
Never (detextive.core.BehaviorTristate)
AsNeeded (detextive.core.BehaviorTristate)
Always (detextive.core.BehaviorTristate)
- class detextive.core.Behaviors(*, bytes_quantity_confidence_divisor=1024, charset_detect=BehaviorTristate.AsNeeded, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=DetectFailureActions.Default, charset_promotions=<factory>, mimetype_detect=BehaviorTristate.AsNeeded, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=DetectFailureActions.Default, on_decode_error='strict', text_validate=BehaviorTristate.AsNeeded, text_validate_confidence=0.8, trial_codecs=(CodecSpecifiers.FromInference, CodecSpecifiers.UserSupplement), trial_decode=BehaviorTristate.AsNeeded, trial_decode_confidence=0.8)¶
Bases:
DataclassObjectHow functions behave.
- Variables:
bytes_quantity_confidence_divisor (int) – Minimum number of bytes for full detection confidence.
charset_detect (detextive.core.BehaviorTristate) – When to detect charset from content.
charset_detectors_order (collections.abc.Sequence[ str ]) – Order in which charset detectors should be applied.
charset_on_detect_failure (detextive.core.DetectFailureActions) – Action to take on charset detection failure.
charset_promotions (collections.abc.Mapping[ str, str ]) –
Which detected charsets to promote to other charsets.
E.g., 7-bit ASCII to UTF-8.
mimetype_detect (detextive.core.BehaviorTristate) – When to detect MIME type from content.
mimetype_detectors_order (collections.abc.Sequence[ str ]) – Order in which MIME type detectors should be applied.
mimetype_on_detect_failure (detextive.core.DetectFailureActions) – Action to take on MIME type detection failure.
on_decode_error (str) –
Response to charset decoding errors.
Standard values are ‘ignore’, ‘replace’, and ‘strict’. Can also be any other name which has been registered via the ‘register_error’ function in the Python standard library ‘codecs’ module.
text_validate (detextive.core.BehaviorTristate) – When to validate text.
text_validate_confidence (float) – Minimum confidence to skip text validation.
trial_codecs (collections.abc.Sequence[ str | detextive.core.CodecSpecifiers ]) – Sequence of codec names or specifiers.
trial_decode (detextive.core.BehaviorTristate) – When to perform trial decode of content with charset.
trial_decode_confidence (float) – Minimum confidence to skip trial decode.
- class detextive.core.CharsetResult(*, charset, confidence)¶
Bases:
DataclassObjectCharacter set encoding with detection confidence.
- class detextive.core.CodecSpecifiers(value)¶
Bases:
EnumSpecifiers for dynamic codecs.
- Variables:
FromInference (detextive.core.CodecSpecifiers)
OsDefault (detextive.core.CodecSpecifiers)
PythonDefault (detextive.core.CodecSpecifiers)
UserSupplement (detextive.core.CodecSpecifiers)
- class detextive.core.DetectFailureActions(value)¶
Bases:
EnumPossible responses to detection failure.
- Variables:
Default (detextive.core.DetectFailureActions)
- class detextive.core.MimetypeResult(*, mimetype, confidence)¶
Bases:
DataclassObjectMIME type with detection confidence.
- detextive.core.confidence_from_bytes_quantity(content, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8))¶
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors)
- Return type:
Module detextive.decoders¶
Conversion of bytes arrays to Unicode text.
- detextive.decoders.decode(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), profile=Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Co', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)¶
Decodes bytes array to Unicode text.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
profile (detextive.validation.Profile) – Text validation profile for content analysis.
charset_default (str) – Default character set to use when detection fails.
mimetype_default (str) – Default MIME type to use when detection fails.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.
- Return type:
Module detextive.detectors¶
Core detection function implementations.
- type detextive.detectors.CharsetDetector = collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.CharsetResult | builtins.NotImplementedType]¶
- type detextive.detectors.MimetypeDetector = collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.MimetypeResult | builtins.NotImplementedType]¶
- detextive.detectors.charset_detectors: accretive.dictionaries.Dictionary[str, collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.CharsetResult | builtins.NotImplementedType]] = accretive.dictionaries.Dictionary( {'chardet': <function _detect_via_chardet at 0x7f38ef01bbe0>, 'charset-normalizer': <function _detect_via_charset_normalizer at 0x7f38ef15f880>} )¶
- detextive.detectors.mimetype_detectors: accretive.dictionaries.Dictionary[str, collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.MimetypeResult | builtins.NotImplementedType]] = accretive.dictionaries.Dictionary( {'magic': <function _detect_via_magic at 0x7f38ef036cb0>, 'puremagic': <function _detect_via_puremagic at 0x7f38ef1b77f0>} )¶
- detextive.detectors.detect_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='utf-8', supplement=absence.absent, mimetype=absence.absent, location=absence.absent)¶
Detects character set.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Default character set to use when detection fails.
supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype (str | absence.objects.AbsentSingleton) – MIME type hint to influence character set detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
str | None
- detextive.detectors.detect_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='utf-8', supplement=absence.absent, mimetype=absence.absent, location=absence.absent)¶
Detects character set candidates with confidence scores.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Default character set to use when detection fails.
supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype (str | absence.objects.AbsentSingleton) – MIME type hint to influence character set detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
- detextive.detectors.detect_mimetype(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='application/octet-stream', charset=absence.absent, location=absence.absent)¶
Detects most probable MIME type.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Default MIME type to use when detection fails.
charset (str | absence.objects.AbsentSingleton) – Character set hint to influence MIME type detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
- detextive.detectors.detect_mimetype_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='application/octet-stream', charset=absence.absent, location=absence.absent)¶
Detects MIME type candidates with confidence scores.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Default MIME type to use when detection fails.
charset (str | absence.objects.AbsentSingleton) – Character set hint to influence MIME type detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
Module detextive.exceptions¶
Family of exceptions for package API.
- exception detextive.exceptions.CharsetDetectFailure(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.CharsetInferFailure(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.ContentDecodeFailure(charset, location=absence.absent)¶
Bases:
Omnierror,UnicodeError
- exception detextive.exceptions.ContentDecodeImpossibility(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.MimetypeDetectFailure(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.MimetypeInferFailure(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.Omnierror(*posargs, **nomargs)¶
Bases:
Omniexception,ExceptionBase for error exceptions raised by package API.
- exception detextive.exceptions.Omniexception(*posargs, **nomargs)¶
Bases:
Object,BaseExceptionBase for all exceptions raised by package API.
- exception detextive.exceptions.TextInvalidity(location=absence.absent)¶
Bases:
Omnierror,TypeError,ValueError
- exception detextive.exceptions.TextualMimetypeInvalidity(mimetype, location=absence.absent)¶
Bases:
Omnierror,ValueError
Module detextive.inference¶
Core detection function implementations.
- detextive.inference.infer_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', http_content_type=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent, location=absence.absent)¶
Infers charset through various means.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Default character set to use when detection fails.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
str | None
- detextive.inference.infer_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', http_content_type=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent, location=absence.absent)¶
Infers charset with confidence level through various means.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Default character set to use when detection fails.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
- Return type:
- detextive.inference.infer_mimetype_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)¶
Infers MIME type and charset through various means.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Default character set to use when detection fails.
mimetype_default (str) – Default MIME type to use when detection fails.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.
- Return type:
- detextive.inference.infer_mimetype_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)¶
Infers MIME type and charset through various means.
- Parameters:
content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Default character set to use when detection fails.
mimetype_default (str) – Default MIME type to use when detection fails.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.
- Return type:
tuple[ detextive.core.MimetypeResult, detextive.core.CharsetResult ]
- detextive.inference.parse_http_content_type(http_content_type)¶
Parses RFC 9110 HTTP Content-Type header.
Returns normalized MIME type and charset, if able to be extracted. Marks either as absent, if not able to be extracted.
Module detextive.lineseparators¶
Line separator enumeration and utilities.
- class detextive.lineseparators.LineSeparators(value)¶
Bases:
EnumLine separators for cross-platform text processing.
- Variables:
- classmethod detect_bytes(content, limit=1024)¶
Detects line separator from byte content sample.
Returns detected LineSeparators enum member or None.
- classmethod detect_text(text, limit=1024)¶
Detects line separator from text (Unicode string).
Returns detected LineSeparators enum member or None.
- nativize(content)¶
Converts Unix LF to this platform’s line separator.
- normalize(content)¶
Normalizes specific line separator to Unix LF format.
- classmethod normalize_universal(content)¶
Normalizes all line separators to Unix LF format.
Module detextive.mimetypes¶
Determination of MIME types and textuality thereof.
- detextive.mimetypes.is_textual_mimetype(mimetype)¶
Checks if MIME type represents textual content.
- detextive.mimetypes.mimetype_from_location(location)¶
Determines MIME type from file location.
- Parameters:
location (str | os.PathLike[ str ]) – Local filesystem location or URL for context.
- Return type:
str | absence.objects.AbsentSingleton
Module detextive.nomina¶
Common names and type aliases.
- type detextive.nomina.Location = str | os.PathLike[str]¶
- type detextive.nomina.LocationArgument = str | os.PathLike[str] | absence.objects.AbsentSingleton¶
Module detextive.validation¶
Validation of textual content.
- type detextive.validation.ProfileArgument = detextive.validation.Profile¶
- detextive.validation.PROFILE_PRINTER_SAFE: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2069', '\n', '\x0c', '\u202c', '\u202e', '\r', '\u2068', '\u202b', '\u061c', '\u200e', '\u2067', '\u200f', '\u200c', '\u200d', '\u2066', '\t', '\u202d'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Zp', 'Cf', 'Cc', 'Cs', 'Co', 'Zl'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶
- detextive.validation.PROFILE_TEXTUAL: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Co', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶
- detextive.validation.PROFILE_TERMINAL_SAFE: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Zp', 'Cf', 'Cc', 'Cs', 'Co', 'Zl'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶
- detextive.validation.PROFILE_TERMINAL_SAFE_ANSI: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2069', '\n', '\u202c', '\x1b', '\u202e', '\r', '\u2068', '\u202b', '\u061c', '\u200e', '\u2067', '\u200f', '\u200c', '\u200d', '\u2066', '\t', '\u202d'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Zp', 'Cf', 'Cc', 'Cs', 'Co', 'Zl'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶
- class detextive.validation.Profile(*, acceptable_characters=frozenset({'\t', '\n', '\r', '\u061c', '\u200c', '\u200d', '\u200e', '\u200f', '\u202a', '\u202b', '\u202c', '\u202d', '\u202e', '\u2066', '\u2067', '\u2068', '\u2069'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Co', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶
Bases:
DataclassObjectConfiguration for text validation heuristics.
- Variables:
acceptable_characters (collections.abc.Set[ str ]) – Set of characters which are always considered valid.
check_bom (bool) – Allow leading BOM; reject embedded BOMs.
printables_ratio_min (float) – Minimum ratio of printable characters to total characters.
rejectable_characters (collections.abc.Set[ str ]) – Set of characters which are always considered invalid.
rejectable_families (collections.abc.Set[ str ]) – Set of Unicode categories which are always considered invalid.
rejectables_ratio_max (float) – Maximum ratio of rejectable characters to total characters.
sample_quantity (int | None) – Number of characters to sample.
- detextive.validation.is_valid_text(text, /, profile=Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Co', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192))¶
Is content valid against profile?
- Parameters:
text (str)
profile (detextive.validation.Profile)
- Return type: