API

Package detextive

Detects textual content.

Module detextive.charsets

Management of bytes array decoding via trial character sets.

detextive.charsets.attempt_decodes(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), inference=absence.absent, supplement=absence.absent, location=absence.absent, validator=absence.absent)

Attempts to decode content with various character sets.

Will try character sets in the order specified by the trial codecs listed on the behaviors object.

Parameters:
Return type:

tuple[ str, detextive.core.CharsetResult ]

detextive.charsets.discover_os_charset_default()

Discovers default character set encoding from operating system.

Return type:

str

detextive.charsets.normalize_charset(charset, bom_cognizant=False)

Normalizes character set encoding names.

Parameters:
  • charset (str)

  • bom_cognizant (bool)

Return type:

str

detextive.charsets.normalize_charset_for_content(content, charset)

Normalizes charset reporting based on byte-order mark provenance.

Parameters:
  • content (bytes) – Raw byte content for analysis.

  • charset (str)

Return type:

str

detextive.charsets.trial_decode_as_confident(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), inference=absence.absent, confidence=0.0, supplement=absence.absent, location=absence.absent)

Performs trial decode of content.

Considers desired trial decode behavior and detection confidence.

Parameters:
Return type:

detextive.core.CharsetResult

Module detextive.core

Core types and behaviors.

type detextive.core.BehaviorsArgument = detextive.core.Behaviors
class detextive.core.BehaviorTristate(value)

Bases: Enum

When to apply behavior.

Variables:
class detextive.core.Behaviors(*, bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=DetectFailureActions.Default, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=DetectFailureActions.Default, on_decode_error='strict', remove_bom=True, text_validate=BehaviorTristate.AsNeeded, text_validate_confidence=0.8, trial_codecs=(CodecSpecifiers.UserSupplement, 'utf-8', CodecSpecifiers.FromInference, CodecSpecifiers.OsDefault, CodecSpecifiers.PythonDefault), trial_decode=BehaviorTristate.AsNeeded, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False)

Bases: DataclassObject

How functions behave.

Variables:
  • bytes_quantity_confidence_divisor (int) – Minimum number of bytes for full detection confidence.

  • charset_detect (bool) – Whether to detect charset from content.

  • charset_detectors_order (collections.abc.Sequence[ str ]) – Order in which charset detectors should be applied.

  • charset_on_detect_failure (detextive.core.DetectFailureActions) – Action to take on charset detection failure.

  • mimetype_detect (bool) – Whether to detect MIME type from content.

  • mimetype_detectors_order (collections.abc.Sequence[ str ]) – Order in which MIME type detectors should be applied.

  • mimetype_on_detect_failure (detextive.core.DetectFailureActions) – Action to take on MIME type detection failure.

  • on_decode_error (str) –

    Response to charset decoding errors.

    Standard values are ‘ignore’, ‘replace’, and ‘strict’. Can also be any other name which has been registered via the ‘register_error’ function in the Python standard library ‘codecs’ module.

  • remove_bom (bool) – Remove byte-ordering mark?

  • text_validate (detextive.core.BehaviorTristate) – When to validate text.

  • text_validate_confidence (float) – Minimum confidence to skip text validation.

  • trial_codecs (collections.abc.Sequence[ str | detextive.core.CodecSpecifiers ]) – Sequence of codec names or specifiers.

  • trial_decode (detextive.core.BehaviorTristate) – When to perform trial decode of content with charset.

  • trial_decode_confidence (float) – Minimum confidence to skip trial decode.

  • utf_16_32_requires_byte_order (bool) – Require explicit byte order for BOM-less generic UTF-16/32?

class detextive.core.CharsetResult(*, charset, confidence)

Bases: DataclassObject

Character set encoding with detection confidence.

Variables:
  • charset (str | None) – Detected character set encoding. May be None.

  • confidence (float) – Detection confidence from 0.0 to 1.0.

class detextive.core.CodecSpecifiers(value)

Bases: Enum

Specifiers for dynamic codecs.

Variables:
class detextive.core.DetectFailureActions(value)

Bases: Enum

Possible responses to detection failure.

Variables:
class detextive.core.MimetypeResult(*, mimetype, confidence)

Bases: DataclassObject

MIME type with detection confidence.

Variables:
  • mimetype (str) – Detected MIME type.

  • confidence (float) – Detection confidence from 0.0 to 1.0.

detextive.core.confidence_from_bytes_quantity(content, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False))
Parameters:
Return type:

float

Module detextive.decoders

Conversion of bytes arrays to Unicode text.

class detextive.decoders.DecodeInformResult(*, text, charset, mimetype, linesep)

Bases: DataclassObject

Decoded text with supplemental inference metadata.

Variables:
detextive.decoders.decode(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), profile=Profile(acceptable_characters=frozenset({'\u202d', '\u2068', '\u061c', '\u2066', '\t', '\u200c', '\u2067', '\r', '\n', '\u200d', '\u2069', '\u202b', '\u200f', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Cs', 'Co'}), rejectables_ratio_max=0.0, sample_quantity=8192), http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent)

Decodes bytes array to Unicode text.

Uses trial decoding and validation; does not provide default-return semantics. The charset_supplement parameter is a trial hint and not a fallback return value.

Parameters:
Return type:

str

detextive.decoders.decode_inform(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), profile=Profile(acceptable_characters=frozenset({'\u202d', '\u2068', '\u061c', '\u2066', '\t', '\u200c', '\u2067', '\r', '\n', '\u200d', '\u2069', '\u202b', '\u200f', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Cs', 'Co'}), rejectables_ratio_max=0.0, sample_quantity=8192), mimetype_default='text/plain', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent)

Decodes bytes and returns supplemental inference metadata.

Parameters:
Return type:

detextive.decoders.DecodeInformResult

Module detextive.detectors

Core detection function implementations.

type detextive.detectors.CharsetDetector = collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.CharsetResult | builtins.NotImplementedType]
type detextive.detectors.MimetypeDetector = collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.MimetypeResult | builtins.NotImplementedType]
detextive.detectors.charset_detectors: accretive.dictionaries.Dictionary[str, collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.CharsetResult | builtins.NotImplementedType]] = accretive.dictionaries.Dictionary( {'chardet': <function _detect_via_chardet at 0x7f20c9c31120>, 'charset-normalizer': <function _detect_via_charset_normalizer at 0x7f20c9d6a560>} )
detextive.detectors.mimetype_detectors: accretive.dictionaries.Dictionary[str, collections.abc.Callable[[bytes, detextive.core.Behaviors], detextive.core.MimetypeResult | builtins.NotImplementedType]] = accretive.dictionaries.Dictionary( {'magic': <function _detect_via_magic at 0x7f20c9da3400>, 'puremagic': <function _detect_via_puremagic at 0x7f20c9c81750>} )
detextive.detectors.detect_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), default='utf-8', supplement=absence.absent, mimetype=absence.absent, location=absence.absent)

Detects character set.

Parameters:
Return type:

str | None

detextive.detectors.detect_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), default='utf-8', supplement=absence.absent, mimetype=absence.absent, location=absence.absent)

Detects character set candidates with confidence scores.

Parameters:
Return type:

detextive.core.CharsetResult

detextive.detectors.detect_mimetype(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), default='application/octet-stream', charset=absence.absent, location=absence.absent)

Detects most probable MIME type.

Parameters:
Return type:

str

detextive.detectors.detect_mimetype_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), default='application/octet-stream', charset=absence.absent, location=absence.absent)

Detects MIME type candidates with confidence scores.

Parameters:
Return type:

detextive.core.MimetypeResult

Module detextive.exceptions

Family of exceptions for package API.

exception detextive.exceptions.BehaviorsInvalidity(attribute, expectation)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.CharsetDetectFailure(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.CharsetInferFailure(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.ContentDecodeFailure(charset, location=absence.absent)

Bases: Omnierror, UnicodeError

exception detextive.exceptions.ContentDecodeImpossibility(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.MimetypeDetectFailure(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.MimetypeInferFailure(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.Omnierror(*posargs, **nomargs)

Bases: Omniexception, Exception

Base for error exceptions raised by package API.

exception detextive.exceptions.Omniexception(*posargs, **nomargs)

Bases: Omniexception

Base for all exceptions raised by package API.

exception detextive.exceptions.TextInvalidity(location=absence.absent)

Bases: Omnierror, TypeError, ValueError

exception detextive.exceptions.TextualMimetypeInvalidity(mimetype, location=absence.absent)

Bases: Omnierror, ValueError

Module detextive.inference

Core detection function implementations.

detextive.inference.infer_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), charset_default='utf-8', http_content_type=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent, location=absence.absent)

Infers charset through various means.

charset_default is the returned fallback when inference cannot determine another charset. charset_supplement is a user-supplied hint used during inference/validation.

Parameters:
Return type:

str | None

detextive.inference.infer_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), charset_default='utf-8', http_content_type=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent, location=absence.absent)

Infers charset with confidence level through various means.

charset_default is the returned fallback when inference cannot determine another charset. charset_supplement is a user-supplied hint used during inference/validation. http_content_type is parsed when supplied, independent of detector enablement behavior.

Parameters:
Return type:

detextive.core.CharsetResult

detextive.inference.infer_mimetype_charset(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)

Infers MIME type and charset through various means.

*_default values are returned fallbacks on inference failure. *_supplement values are user-supplied hints used to guide inference before fallback behavior is applied.

Parameters:
Return type:

tuple[ str, str | None ]

detextive.inference.infer_mimetype_charset_confidence(content, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False), charset_default='utf-8', mimetype_default='application/octet-stream', http_content_type=absence.absent, location=absence.absent, charset_supplement=absence.absent, mimetype_supplement=absence.absent)

Infers MIME type and charset through various means.

Parameters:
Return type:

tuple[ detextive.core.MimetypeResult, detextive.core.CharsetResult ]

detextive.inference.parse_http_content_type(http_content_type)

Parses RFC 9110 HTTP Content-Type header.

Returns normalized MIME type and charset, if able to be extracted. Marks either as absent, if not able to be extracted.

Parameters:

http_content_type (str)

Return type:

tuple[ str | absence.objects.AbsentSingleton, str | None | absence.objects.AbsentSingleton ]

detextive.inference.validate_httpct_charset(content, charset, /, *, behaviors=Behaviors(bytes_quantity_confidence_divisor=1024, charset_detect=True, charset_detectors_order=('chardet', 'charset-normalizer'), charset_on_detect_failure=<DetectFailureActions.Default: 1>, mimetype_detect=True, mimetype_detectors_order=('magic', 'puremagic'), mimetype_on_detect_failure=<DetectFailureActions.Default: 1>, on_decode_error='strict', remove_bom=True, text_validate=<BehaviorTristate.AsNeeded: 2>, text_validate_confidence=0.8, trial_codecs=(<CodecSpecifiers.UserSupplement: 4>, 'utf-8', <CodecSpecifiers.FromInference: 1>, <CodecSpecifiers.OsDefault: 2>, <CodecSpecifiers.PythonDefault: 3>), trial_decode=<BehaviorTristate.AsNeeded: 2>, trial_decode_confidence=0.8, utf_16_32_requires_byte_order=False))
Parameters:
Return type:

detextive.core.CharsetResult | absence.objects.AbsentSingleton

Module detextive.lineseparators

Line separator enumeration and utilities.

class detextive.lineseparators.LineSeparators(value)

Bases: Enum

Line separators for cross-platform text processing.

Variables:
classmethod detect_bytes(content, limit=1024)

Detects line separator from byte content sample.

Returns detected LineSeparators enum member or None.

classmethod detect_text(text, limit=1024)

Detects line separator from text (Unicode string).

Returns detected LineSeparators enum member or None.

nativize(content)

Converts Unix LF to this platform’s line separator.

normalize(content)

Normalizes specific line separator to Unix LF format.

classmethod normalize_universal(content)

Normalizes all line separators to Unix LF format.

Module detextive.mimetypes

Determination of MIME types and textuality thereof.

detextive.mimetypes.is_textual_mimetype(mimetype)

Checks if MIME type represents textual content.

Parameters:

mimetype (str)

Return type:

bool

detextive.mimetypes.mimetype_from_location(location)

Determines MIME type from file location.

Parameters:

location (str | os.PathLike[ str ]) – Local filesystem location or URL for context.

Return type:

str | absence.objects.AbsentSingleton

Module detextive.nomina

Common names and type aliases.

type detextive.nomina.Content = bytes
type detextive.nomina.Location = str | os.PathLike[str]
type detextive.nomina.CharsetAssumptionArgument = str | absence.objects.AbsentSingleton
type detextive.nomina.CharsetDefaultArgument = str
type detextive.nomina.CharsetSupplementArgument = str | absence.objects.AbsentSingleton
type detextive.nomina.HttpContentTypeArgument = str | absence.objects.AbsentSingleton
type detextive.nomina.LocationArgument = str | os.PathLike[str] | absence.objects.AbsentSingleton
type detextive.nomina.MimetypeAssumptionArgument = str | absence.objects.AbsentSingleton
type detextive.nomina.MimetypeDefaultArgument = str
type detextive.nomina.MimetypeSupplementArgument = str | absence.objects.AbsentSingleton

Module detextive.validation

Validation of textual content.

type detextive.validation.ProfileArgument = detextive.validation.Profile
detextive.validation.PROFILE_PRINTER_SAFE: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202d', '\u061c', '\u2066', '\n', '\u2069', '\u200f', '\u2068', '\u200c', '\t', '\x0c', '\u2067', '\r', '\u200d', '\u202b', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Zp', 'Co', 'Zl', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)
detextive.validation.PROFILE_TEXTUAL: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202d', '\u2068', '\u061c', '\u2066', '\t', '\u200c', '\u2067', '\r', '\n', '\u200d', '\u2069', '\u202b', '\u200f', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Cs', 'Co'}), rejectables_ratio_max=0.0, sample_quantity=8192)
detextive.validation.PROFILE_TERMINAL_SAFE: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202d', '\u2068', '\u061c', '\u2066', '\t', '\u200c', '\u2067', '\r', '\n', '\u200d', '\u2069', '\u202b', '\u200f', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Zp', 'Co', 'Zl', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)
detextive.validation.PROFILE_TERMINAL_SAFE_ANSI: detextive.validation.Profile = Profile(acceptable_characters=frozenset({'\u202d', '\u061c', '\u2066', '\n', '\u2069', '\u200f', '\u2068', '\u200c', '\t', '\x1b', '\u2067', '\r', '\u200d', '\u202b', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=False, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Zp', 'Co', 'Zl', 'Cf', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)
class detextive.validation.Profile(*, acceptable_characters=frozenset({'\t', '\n', '\r', '\u061c', '\u200c', '\u200d', '\u200e', '\u200f', '\u202a', '\u202b', '\u202c', '\u202d', '\u202e', '\u2066', '\u2067', '\u2068', '\u2069'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Co', 'Cs'}), rejectables_ratio_max=0.0, sample_quantity=8192)

Bases: DataclassObject

Configuration for text validation heuristics.

Variables:
  • acceptable_characters (collections.abc.Set[ str ]) – Set of characters which are always considered valid.

  • check_bom (bool) – Allow leading BOM; reject embedded BOMs.

  • printables_ratio_min (float) – Minimum ratio of printable characters to total characters.

  • rejectable_characters (collections.abc.Set[ str ]) – Set of characters which are always considered invalid.

  • rejectable_families (collections.abc.Set[ str ]) – Set of Unicode categories which are always considered invalid.

  • rejectables_ratio_max (float) – Maximum ratio of rejectable characters to total characters.

  • sample_quantity (int | None) – Number of characters to sample.

detextive.validation.is_valid_text(text, /, profile=Profile(acceptable_characters=frozenset({'\u202d', '\u2068', '\u061c', '\u2066', '\t', '\u200c', '\u2067', '\r', '\n', '\u200d', '\u2069', '\u202b', '\u200f', '\u200e', '\u202e', '\u202a', '\u202c'}), check_bom=True, printables_ratio_min=0.85, rejectable_characters=frozenset({'\x7f'}), rejectable_families=frozenset({'Cc', 'Cf', 'Cs', 'Co'}), rejectables_ratio_max=0.0, sample_quantity=8192))

Is content valid against profile?

Parameters:
Return type:

bool