API¶

Package `detextive`¶

Detects textual content.

Module `detextive.charsets`¶

Management of bytes array decoding via trial character sets.

detextive.charsets.attempt_decodes(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), inference=absence.absent, supplement=absence.absent, location=absence.absent)¶

Attempts to decode content with various character sets.

Will try character sets in the order specified by the trial codecs listed on the behaviors object.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors)
inference (str | absence.objects.AbsentSingleton)
supplement (str | absence.objects.AbsentSingleton)
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton)

Return type:

tuple[ str, detextive.core.CharsetResult ]

detextive.charsets.discover_os_charset_default()¶

Discovers default character set encoding from operating system.

Return type:: str

detextive.charsets.normalize_charset(charset)¶

Normalizes character set encoding names.

Parameters:: charset (str)
Return type:: str

detextive.charsets.trial_decode_as_confident(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), inference=absence.absent, confidence=0.0, supplement=absence.absent, location=absence.absent)¶

Performs trial decode of content.

Considers desired trial decode behavior and detection confidence.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors)
inference (str | absence.objects.AbsentSingleton)
confidence (float)
supplement (str | absence.objects.AbsentSingleton)
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton)

Return type:

detextive.core.CharsetResult

Module `detextive.core`¶

Core types and behaviors.

type detextive.core.BehaviorsArgument = detextive.core.Behaviors¶

class detextive.core.BehaviorTristate(value)¶

Bases: Enum

When to apply behavior.

Variables:

Never (detextive.core.BehaviorTristate)
AsNeeded (detextive.core.BehaviorTristate)
Always (detextive.core.BehaviorTristate)

class detextive.core.Behaviors(*, bytes_quantity_confidence_divisor=1024, charset_detect=BehaviorTristate.AsNeeded, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=DetectFailureActions.Default, charset_promotions=<factory>, mimetype_detect=BehaviorTristate.AsNeeded, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=DetectFailureActions.Default, on_decode_error='strict', text_validate=BehaviorTristate.AsNeeded, text_validate_confidence=0.8, trial_codecs=(CodecSpecifiers.FromInference, CodecSpecifiers.UserSupplement), trial_decode=BehaviorTristate.AsNeeded, trial_decode_confidence=0.8)¶

Bases: DataclassObject

How functions behave.

Variables:

bytes_quantity_confidence_divisor (int) – Minimum number of bytes for full detection confidence.
charset_detect (detextive.core.BehaviorTristate) – When to detect charset from content.
charset_detectors_order (collections.abc.Sequence[ str ]) – Order in which charset detectors should be applied.
charset_on_detect_failure (detextive.core.DetectFailureActions) – Action to take on charset detection failure.
charset_promotions (collections.abc.Mapping[ str, str ]) –
Which detected charsets to promote to other charsets.

E.g., 7-bit ASCII to UTF-8.
mimetype_detect (detextive.core.BehaviorTristate) – When to detect MIME type from content.
mimetype_detectors_order (collections.abc.Sequence[ str ]) – Order in which MIME type detectors should be applied.
mimetype_on_detect_failure (detextive.core.DetectFailureActions) – Action to take on MIME type detection failure.
on_decode_error (str) –
Response to charset decoding errors.

Standard values are ‘ignore’, ‘replace’, and ‘strict’. Can also be any other name which has been registered via the ‘register_error’ function in the Python standard library ‘codecs’ module.
text_validate (detextive.core.BehaviorTristate) – When to validate text.
text_validate_confidence (float) – Minimum confidence to skip text validation.
trial_codecs (collections.abc.Sequence[ str | detextive.core.CodecSpecifiers ]) – Sequence of codec names or specifiers.
trial_decode (detextive.core.BehaviorTristate) – When to perform trial decode of content with charset.
trial_decode_confidence (float) – Minimum confidence to skip trial decode.

class detextive.core.CharsetResult(*, charset, confidence)¶

Bases: DataclassObject

Character set encoding with detection confidence.

Variables:

charset (str | None) – Detected character set encoding. May be None.
confidence (float) – Detection confidence from 0.0 to 1.0.

class detextive.core.CodecSpecifiers(value)¶

Bases: Enum

Specifiers for dynamic codecs.

Variables:

FromInference (detextive.core.CodecSpecifiers)
OsDefault (detextive.core.CodecSpecifiers)
PythonDefault (detextive.core.CodecSpecifiers)
UserSupplement (detextive.core.CodecSpecifiers)

class detextive.core.DetectFailureActions(value)¶

Bases: Enum

Possible responses to detection failure.

Variables:

Default (detextive.core.DetectFailureActions)
Error (detextive.core.DetectFailureActions)

class detextive.core.MimetypeResult(*, mimetype, confidence)¶

Bases: DataclassObject

MIME type with detection confidence.

Variables:

mimetype (str) – Detected MIME type.
confidence (float) – Detection confidence from 0.0 to 1.0.

detextive.core.confidence_from_bytes_quantity(content, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8))¶

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors)

Return type:

float

Module `detextive.decoders`¶

Conversion of bytes arrays to Unicode text.

detextive.decoders.decode(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), profile=Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Co', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)¶

Decodes bytes array to Unicode text.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
profile (detextive.validation.Profile) – Text validation profile for content analysis.
charset_default (str) – Default character set to use when detection fails.
mimetype_default (str) – Default MIME type to use when detection fails.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.

Return type:

str

Module `detextive.detectors`¶

Core detection function implementations.

type detextive.detectors.CharsetDetector = collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.CharsetResult | builtins.NotImplementedType]¶

type detextive.detectors.MimetypeDetector = collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.MimetypeResult | builtins.NotImplementedType]¶

detextive.detectors.charset_detectors: accretive.dictionaries.Dictionary[str, collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.CharsetResult | builtins.NotImplementedType]] = accretive.dictionaries.Dictionary( {'chardet': <function _detect_via_chardet at 0x7f38ef01bbe0>, 'charset-normalizer': <function _detect_via_charset_normalizer at 0x7f38ef15f880>} )¶

detextive.detectors.mimetype_detectors: accretive.dictionaries.Dictionary[str, collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.MimetypeResult | builtins.NotImplementedType]] = accretive.dictionaries.Dictionary( {'magic': <function _detect_via_magic at 0x7f38ef036cb0>, 'puremagic': <function _detect_via_puremagic at 0x7f38ef1b77f0>} )¶

detextive.detectors.detect_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='utf-8', supplement=absence.absent, mimetype=absence.absent, location=absence.absent)¶

Detects character set.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Default character set to use when detection fails.
supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype (str | absence.objects.AbsentSingleton) – MIME type hint to influence character set detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

str | None

detextive.detectors.detect_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='utf-8', supplement=absence.absent, mimetype=absence.absent, location=absence.absent)¶

Detects character set candidates with confidence scores.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Default character set to use when detection fails.
supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype (str | absence.objects.AbsentSingleton) – MIME type hint to influence character set detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

detextive.core.CharsetResult

detextive.detectors.detect_mimetype(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='application/octet-stream', charset=absence.absent, location=absence.absent)¶

Detects most probable MIME type.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Default MIME type to use when detection fails.
charset (str | absence.objects.AbsentSingleton) – Character set hint to influence MIME type detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

str

detextive.detectors.detect_mimetype_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='application/octet-stream', charset=absence.absent, location=absence.absent)¶

Detects MIME type candidates with confidence scores.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
default (str) – Default MIME type to use when detection fails.
charset (str | absence.objects.AbsentSingleton) – Character set hint to influence MIME type detection.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

detextive.core.MimetypeResult

Module `detextive.exceptions`¶

Family of exceptions for package API.

exception detextive.exceptions.CharsetDetectFailure(location=absence.absent)¶: Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.CharsetInferFailure(location=absence.absent)¶: Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.ContentDecodeFailure(charset, location=absence.absent)¶: Bases: Omnierror, UnicodeError

exception detextive.exceptions.ContentDecodeImpossibility(location=absence.absent)¶: Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.MimetypeDetectFailure(location=absence.absent)¶: Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.MimetypeInferFailure(location=absence.absent)¶: Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.Omnierror(*posargs, **nomargs)¶

Bases: Omniexception, Exception

Base for error exceptions raised by package API.

exception detextive.exceptions.Omniexception(*posargs, **nomargs)¶

Bases: Object, BaseException

Base for all exceptions raised by package API.

exception detextive.exceptions.TextInvalidity(location=absence.absent)¶: Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.TextualMimetypeInvalidity(mimetype, location=absence.absent)¶: Bases: Omnierror, ValueError

Module `detextive.inference`¶

Core detection function implementations.

detextive.inference.infer_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', http_content_type=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent, location=absence.absent)¶

Infers charset through various means.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Default character set to use when detection fails.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

str | None

detextive.inference.infer_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', http_content_type=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent, location=absence.absent)¶

Infers charset with confidence level through various means.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Default character set to use when detection fails.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

detextive.core.CharsetResult

detextive.inference.infer_mimetype_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)¶

Infers MIME type and charset through various means.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Default character set to use when detection fails.
mimetype_default (str) – Default MIME type to use when detection fails.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.

Return type:

tuple[ str, str | None ]

detextive.inference.infer_mimetype_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)¶

Infers MIME type and charset through various means.

Parameters:

content (bytes) – Raw byte content for analysis.
behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.
charset_default (str) – Default character set to use when detection fails.
mimetype_default (str) – Default MIME type to use when detection fails.
http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.
location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.
charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.
mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.

Return type:

tuple[ detextive.core.MimetypeResult, detextive.core.CharsetResult ]

detextive.inference.parse_http_content_type(http_content_type)¶

Parses RFC 9110 HTTP Content-Type header.

Returns normalized MIME type and charset, if able to be extracted. Marks either as absent, if not able to be extracted.

Parameters:: http_content_type (str)
Return type:: tuple[ str | absence.objects.AbsentSingleton, str | None | absence.objects.AbsentSingleton ]

Module `detextive.lineseparators`¶

Line separator enumeration and utilities.

class detextive.lineseparators.LineSeparators(value)¶

Bases: Enum

Line separators for cross-platform text processing.

Variables:

CR (detextive.lineseparators.LineSeparators)
CRLF (detextive.lineseparators.LineSeparators)
LF (detextive.lineseparators.LineSeparators)

classmethod detect_bytes(content, limit=1024)¶

Detects line separator from byte content sample.

Returns detected LineSeparators enum member or None.

classmethod detect_text(text, limit=1024)¶

Detects line separator from text (Unicode string).

Returns detected LineSeparators enum member or None.

nativize(content)¶: Converts Unix LF to this platform’s line separator.

normalize(content)¶: Normalizes specific line separator to Unix LF format.

classmethod normalize_universal(content)¶: Normalizes all line separators to Unix LF format.

Module `detextive.mimetypes`¶

Determination of MIME types and textuality thereof.

detextive.mimetypes.is_textual_mimetype(mimetype)¶

Checks if MIME type represents textual content.

Parameters:: mimetype (str)
Return type:: bool

detextive.mimetypes.mimetype_from_location(location)¶

Determines MIME type from file location.

Parameters:: location (str | os.PathLike[ str ]) – Local filesystem location or URL for context.
Return type:: str | absence.objects.AbsentSingleton

Module `detextive.nomina`¶

Common names and type aliases.

type detextive.nomina.Content = bytes¶

type detextive.nomina.Location = str | os.PathLike[str]¶

type detextive.nomina.CharsetAssumptionArgument = str | absence.objects.AbsentSingleton¶

type detextive.nomina.CharsetDefaultArgument = str¶

type detextive.nomina.CharsetSupplementArgument = str | absence.objects.AbsentSingleton¶

type detextive.nomina.HttpContentTypeArgument = str | absence.objects.AbsentSingleton¶

type detextive.nomina.LocationArgument = str | os.PathLike[str] | absence.objects.AbsentSingleton¶

type detextive.nomina.MimetypeAssumptionArgument = str | absence.objects.AbsentSingleton¶

type detextive.nomina.MimetypeDefaultArgument = str¶

type detextive.nomina.MimetypeSupplementArgument = str | absence.objects.AbsentSingleton¶

Module `detextive.validation`¶

Validation of textual content.

type detextive.validation.ProfileArgument = detextive.validation.Profile¶

detextive.validation.PROFILE_PRINTER_SAFE: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2069', '\n', '\x0c', '\u202c', '\u202e', '\r', '\u2068', '\u202b', '\u061c', '\u200e', '\u2067', '\u200f', '\u200c', '\u200d', '\u2066', '\t', '\u202d'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Zp', 'Cf', 'Cc', 'Cs', 'Co', 'Zl'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶

detextive.validation.PROFILE_TEXTUAL: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Co', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶

detextive.validation.PROFILE_TERMINAL_SAFE: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Zp', 'Cf', 'Cc', 'Cs', 'Co', 'Zl'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶

detextive.validation.PROFILE_TERMINAL_SAFE_ANSI: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2069', '\n', '\u202c', '\x1b', '\u202e', '\r', '\u2068', '\u202b', '\u061c', '\u200e', '\u2067', '\u200f', '\u200c', '\u200d', '\u2066', '\t', '\u202d'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Zp', 'Cf', 'Cc', 'Cs', 'Co', 'Zl'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶

class detextive.validation.Profile(*, acceptable_characters=frozenset({'\t', '\n', '\r', '\u061c', '\u200c', '\u200d', '\u200e', '\u200f', '\u202a', '\u202b', '\u202c', '\u202d', '\u202e', '\u2066', '\u2067', '\u2068', '\u2069'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Co', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)¶

Bases: DataclassObject

Configuration for text validation heuristics.

Variables:

acceptable_characters (collections.abc.Set[ str ]) – Set of characters which are always considered valid.
check_bom (bool) – Allow leading BOM; reject embedded BOMs.
printables_ratio_min (float) – Minimum ratio of printable characters to total characters.
rejectable_characters (collections.abc.Set[ str ]) – Set of characters which are always considered invalid.
rejectable_families (collections.abc.Set[ str ]) – Set of Unicode categories which are always considered invalid.
rejectables_ratio_max (float) – Maximum ratio of rejectable characters to total characters.
sample_quantity (int | None) – Number of characters to sample.

detextive.validation.is_valid_text(text, /, profile=Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Co', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192))¶

Is content valid against profile?

Parameters:

text (str)
profile (detextive.validation.Profile)

Return type:

bool

API¶

Package detextive¶

Module detextive.charsets¶

Module detextive.core¶

Module detextive.decoders¶

Module detextive.detectors¶

Module detextive.exceptions¶

Module detextive.inference¶

Module detextive.lineseparators¶

Module detextive.mimetypes¶

Module detextive.nomina¶

Module detextive.validation¶

Package `detextive`¶

Module `detextive.charsets`¶

Module `detextive.core`¶

Module `detextive.decoders`¶

Module `detextive.detectors`¶

Module `detextive.exceptions`¶

Module `detextive.inference`¶

Module `detextive.lineseparators`¶

Module `detextive.mimetypes`¶

Module `detextive.nomina`¶

Module `detextive.validation`¶