API

Package detextive

Detects textual content.

Module detextive.charsets

Management of bytes array decoding via trial character sets.

detextive.charsets.attempt_decodes(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), inference=absence.absent, supplement=absence.absent, location=absence.absent)

Attempts to decode content with various character sets.

Will try character sets in the order specified by the trial codecs listed on the behaviors object.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • behaviors (detextive.core.Behaviors)

  • inference (str | absence.objects.AbsentSingleton)

  • supplement (str | absence.objects.AbsentSingleton)

  • location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton)

Return type:

tuple[ str, detextive.core.CharsetResult ]

detextive.charsets.discover_os_charset_default()

Discovers default character set encoding from operating system.

Return type:

str

detextive.charsets.normalize_charset(charset)

Normalizes character set encoding names.

Parameters:

charset (str)

Return type:

str

detextive.charsets.trial_decode_as_confident(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), inference=absence.absent, confidence=0.0, supplement=absence.absent, location=absence.absent)

Performs trial decode of content.

Considers desired trial decode behavior and detection confidence.

Parameters:
Return type:

detextive.core.CharsetResult

Module detextive.core

Core types and behaviors.

type detextive.core.BehaviorsArgument = detextive.core.Behaviors
class detextive.core.BehaviorTristate(value)

Bases: Enum

When to apply behavior.

Variables:
class detextive.core.Behaviors(*, bytes_quantity_confidence_divisor=1024, charset_detect=BehaviorTristate.AsNeeded, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=DetectFailureActions.Default, charset_promotions=<factory>, mimetype_detect=BehaviorTristate.AsNeeded, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=DetectFailureActions.Default, on_decode_error='strict', text_validate=BehaviorTristate.AsNeeded, text_validate_confidence=0.8, trial_codecs=(CodecSpecifiers.FromInference, CodecSpecifiers.UserSupplement), trial_decode=BehaviorTristate.AsNeeded, trial_decode_confidence=0.8)

Bases: DataclassObject

How functions behave.

Variables:
class detextive.core.CharsetResult(*, charset, confidence)

Bases: DataclassObject

Character set encoding with detection confidence.

Variables:
  • charset (str | None) – Detected character set encoding. May be None.

  • confidence (float) – Detection confidence from 0.0 to 1.0.

class detextive.core.CodecSpecifiers(value)

Bases: Enum

Specifiers for dynamic codecs.

Variables:
class detextive.core.DetectFailureActions(value)

Bases: Enum

Possible responses to detection failure.

Variables:
class detextive.core.MimetypeResult(*, mimetype, confidence)

Bases: DataclassObject

MIME type with detection confidence.

Variables:
  • mimetype (str) – Detected MIME type.

  • confidence (float) – Detection confidence from 0.0 to 1.0.

detextive.core.confidence_from_bytes_quantity(content, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8))
Parameters:
Return type:

float

Module detextive.decoders

Conversion of bytes arrays to Unicode text.

detextive.decoders.decode(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), profile=Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Co', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)

Decodes bytes array to Unicode text.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.

  • profile (detextive.validation.Profile) – Text validation profile for content analysis.

  • charset_default (str) – Default character set to use when detection fails.

  • mimetype_default (str) – Default MIME type to use when detection fails.

  • http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.

  • location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

  • charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.

  • mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.

Return type:

str

Module detextive.detectors

Core detection function implementations.

type detextive.detectors.CharsetDetector = collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.CharsetResult | builtins.NotImplementedType]
type detextive.detectors.MimetypeDetector = collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.MimetypeResult | builtins.NotImplementedType]
detextive.detectors.charset_detectors: accretive.dictionaries.Dictionary[str, collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.CharsetResult | builtins.NotImplementedType]] = accretive.dictionaries.Dictionary( {'chardet': <function _detect_via_chardet at 0x7f38ef01bbe0>, 'charset-normalizer': <function _detect_via_charset_normalizer at 0x7f38ef15f880>} )
detextive.detectors.mimetype_detectors: accretive.dictionaries.Dictionary[str, collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.MimetypeResult | builtins.NotImplementedType]] = accretive.dictionaries.Dictionary( {'magic': <function _detect_via_magic at 0x7f38ef036cb0>, 'puremagic': <function _detect_via_puremagic at 0x7f38ef1b77f0>} )
detextive.detectors.detect_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='utf-8', supplement=absence.absent, mimetype=absence.absent, location=absence.absent)

Detects character set.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.

  • default (str) – Default character set to use when detection fails.

  • supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.

  • mimetype (str | absence.objects.AbsentSingleton) – MIME type hint to influence character set detection.

  • location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

str | None

detextive.detectors.detect_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='utf-8', supplement=absence.absent, mimetype=absence.absent, location=absence.absent)

Detects character set candidates with confidence scores.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.

  • default (str) – Default character set to use when detection fails.

  • supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.

  • mimetype (str | absence.objects.AbsentSingleton) – MIME type hint to influence character set detection.

  • location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

detextive.core.CharsetResult

detextive.detectors.detect_mimetype(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='application/octet-stream', charset=absence.absent, location=absence.absent)

Detects most probable MIME type.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.

  • default (str) – Default MIME type to use when detection fails.

  • charset (str | absence.objects.AbsentSingleton) – Character set hint to influence MIME type detection.

  • location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

str

detextive.detectors.detect_mimetype_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), default='application/octet-stream', charset=absence.absent, location=absence.absent)

Detects MIME type candidates with confidence scores.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.

  • default (str) – Default MIME type to use when detection fails.

  • charset (str | absence.objects.AbsentSingleton) – Character set hint to influence MIME type detection.

  • location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

detextive.core.MimetypeResult

Module detextive.exceptions

Family of exceptions for package API.

exception detextive.exceptions.CharsetDetectFailure(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.CharsetInferFailure(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.ContentDecodeFailure(charset, location=absence.absent)

Bases: Omnierror, UnicodeError

exception detextive.exceptions.ContentDecodeImpossibility(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.MimetypeDetectFailure(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.MimetypeInferFailure(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.Omnierror(*posargs, **nomargs)

Bases: Omniexception, Exception

Base for error exceptions raised by package API.

exception detextive.exceptions.Omniexception(*posargs, **nomargs)

Bases: Object, BaseException

Base for all exceptions raised by package API.

exception detextive.exceptions.TextInvalidity(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.TextualMimetypeInvalidity(mimetype, location=absence.absent)

Bases: Omnierror, ValueError

Module detextive.inference

Core detection function implementations.

detextive.inference.infer_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', http_content_type=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent, location=absence.absent)

Infers charset through various means.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.

  • charset_default (str) – Default character set to use when detection fails.

  • http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.

  • charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.

  • mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.

  • location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

str | None

detextive.inference.infer_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', http_content_type=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent, location=absence.absent)

Infers charset with confidence level through various means.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.

  • charset_default (str) – Default character set to use when detection fails.

  • http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.

  • charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.

  • mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.

  • location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

Return type:

detextive.core.CharsetResult

detextive.inference.infer_mimetype_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)

Infers MIME type and charset through various means.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.

  • charset_default (str) – Default character set to use when detection fails.

  • mimetype_default (str) – Default MIME type to use when detection fails.

  • http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.

  • location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

  • charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.

  • mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.

Return type:

tuple[ str, str | None ]

detextive.inference.infer_mimetype_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=<BehaviorTristate.AsNeeded: 2>, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, charset_promotions=frigid.dictionaries.Dictionary( {'ascii': 'utf-8-sig', 'utf-8': 'utf-8-sig'} ), mimetype_detect=<BehaviorTristate.AsNeeded: 2>, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.UserSupplement: 4>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)

Infers MIME type and charset through various means.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • behaviors (detextive.core.Behaviors) – Configuration for detection and inference behaviors.

  • charset_default (str) – Default character set to use when detection fails.

  • mimetype_default (str) – Default MIME type to use when detection fails.

  • http_content_type (str | absence.objects.AbsentSingleton) – HTTP Content-Type header for parsing context.

  • location (str | os.PathLike[ str ] | absence.objects.AbsentSingleton) – File location or URL for error reporting context.

  • charset_supplement (str | absence.objects.AbsentSingleton) – Supplemental character set to use for trial decodes.

  • mimetype_supplement (str | absence.objects.AbsentSingleton) – Supplemental MIME type to use for inference.

Return type:

tuple[ detextive.core.MimetypeResult, detextive.core.CharsetResult ]

detextive.inference.parse_http_content_type(http_content_type)

Parses RFC 9110 HTTP Content-Type header.

Returns normalized MIME type and charset, if able to be extracted. Marks either as absent, if not able to be extracted.

Parameters:

http_content_type (str)

Return type:

tuple[ str | absence.objects.AbsentSingleton, str | None | absence.objects.AbsentSingleton ]

Module detextive.lineseparators

Line separator enumeration and utilities.

class detextive.lineseparators.LineSeparators(value)

Bases: Enum

Line separators for cross-platform text processing.

Variables:
classmethod detect_bytes(content, limit=1024)

Detects line separator from byte content sample.

Returns detected LineSeparators enum member or None.

classmethod detect_text(text, limit=1024)

Detects line separator from text (Unicode string).

Returns detected LineSeparators enum member or None.

nativize(content)

Converts Unix LF to this platform’s line separator.

normalize(content)

Normalizes specific line separator to Unix LF format.

classmethod normalize_universal(content)

Normalizes all line separators to Unix LF format.

Module detextive.mimetypes

Determination of MIME types and textuality thereof.

detextive.mimetypes.is_textual_mimetype(mimetype)

Checks if MIME type represents textual content.

Parameters:

mimetype (str)

Return type:

bool

detextive.mimetypes.mimetype_from_location(location)

Determines MIME type from file location.

Parameters:

location (str | os.PathLike[ str ]) – Local filesystem location or URL for context.

Return type:

str | absence.objects.AbsentSingleton

Module detextive.nomina

Common names and type aliases.

type detextive.nomina.Content = bytes
type detextive.nomina.Location = str | os.PathLike[str]
type detextive.nomina.CharsetAssumptionArgument = str | absence.objects.AbsentSingleton
type detextive.nomina.CharsetDefaultArgument = str
type detextive.nomina.CharsetSupplementArgument = str | absence.objects.AbsentSingleton
type detextive.nomina.HttpContentTypeArgument = str | absence.objects.AbsentSingleton
type detextive.nomina.LocationArgument = str | os.PathLike[str] | absence.objects.AbsentSingleton
type detextive.nomina.MimetypeAssumptionArgument = str | absence.objects.AbsentSingleton
type detextive.nomina.MimetypeDefaultArgument = str
type detextive.nomina.MimetypeSupplementArgument = str | absence.objects.AbsentSingleton

Module detextive.validation

Validation of textual content.

type detextive.validation.ProfileArgument = detextive.validation.Profile
detextive.validation.PROFILE_PRINTER_SAFE: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2069', '\n', '\x0c', '\u202c', '\u202e', '\r', '\u2068', '\u202b', '\u061c', '\u200e', '\u2067', '\u200f', '\u200c', '\u200d', '\u2066', '\t', '\u202d'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Zp', 'Cf', 'Cc', 'Cs', 'Co', 'Zl'}), rejectables_ratio_max=0.0, sample_quantity=8192)
detextive.validation.PROFILE_TEXTUAL: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Co', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)
detextive.validation.PROFILE_TERMINAL_SAFE: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Zp', 'Cf', 'Cc', 'Cs', 'Co', 'Zl'}), rejectables_ratio_max=0.0, sample_quantity=8192)
detextive.validation.PROFILE_TERMINAL_SAFE_ANSI: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202a', '\u2069', '\n', '\u202c', '\x1b', '\u202e', '\r', '\u2068', '\u202b', '\u061c', '\u200e', '\u2067', '\u200f', '\u200c', '\u200d', '\u2066', '\t', '\u202d'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Zp', 'Cf', 'Cc', 'Cs', 'Co', 'Zl'}), rejectables_ratio_max=0.0, sample_quantity=8192)
class detextive.validation.Profile(*, acceptable_characters=frozenset({'\t', '\n', '\r', '\u061c', '\u200c', '\u200d', '\u200e', '\u200f', '\u202a', '\u202b', '\u202c', '\u202d', '\u202e', '\u2066', '\u2067', '\u2068', '\u2069'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Co', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)

Bases: DataclassObject

Configuration for text validation heuristics.

Variables:
  • acceptable_characters (collections.abc.Set[ str ]) – Set of characters which are always considered valid.

  • check_bom (bool) – Allow leading BOM; reject embedded BOMs.

  • printables_ratio_min (float) – Minimum ratio of printable characters to total characters.

  • rejectable_characters (collections.abc.Set[ str ]) – Set of characters which are always considered invalid.

  • rejectable_families (collections.abc.Set[ str ]) – Set of Unicode categories which are always considered invalid.

  • rejectables_ratio_max (float) – Maximum ratio of rejectable characters to total characters.

  • sample_quantity (int | None) – Number of characters to sample.

detextive.validation.is_valid_text(text, /, profile=Profile(acceptable_characters=frozenset({'\u202a', '\u2067', '\u200f', '\u2069', '\n', '\u200c', '\u202c', '\u200d', '\r', '\u2066', '\u202e', '\u2068', '\u202b', '\u061c', '\t', '\u200e', '\u202d'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Co', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192))

Is content valid against profile?

Parameters:
Return type:

bool