Skip to content

edsnlp.utils.doc_to_text

aggregate_tokens cached

Aggregate tokens strings, computed from their attr attribute, into a single string, possibly ignoring excluded tokens (like pollution tokens) and/or space tokens. This also returns the start and end offsets of each token in the aggregated string, as well as a bytes array indicating which tokens were kept. The reason for the bytes array is that it is faster to index, and allows reverse indexing as well.

Parameters

PARAMETER DESCRIPTION
doc

TYPE: Doc

attr

TYPE: str

ignore_excluded

TYPE: bool DEFAULT: False

ignore_space_tokens

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Tuple[str, List[int], List[int], bytes]

The aggregated text, the start offsets, the end offsets, and the bytes array indicating which tokens were kept.

get_text [source]

Get text using a custom attribute, possibly ignoring excluded tokens.

Parameters

PARAMETER DESCRIPTION
doclike

Doc or Span to get text from.

TYPE: Union[Doc, Span]

attr

Attribute to use.

TYPE: str

ignore_excluded

Whether to skip excluded tokens, by default False

TYPE: bool

ignore_space_tokens

Whether to skip space tokens, by default False

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
str

Extracted text.

get_char_offsets [source]

Get char offsets of the doc tokens in the "cleaned" text.

Parameters

PARAMETER DESCRIPTION
doclike

Doc or Span to get text from.

TYPE: Union[Doc, Span]

attr

Attribute to use.

TYPE: str

ignore_excluded

Whether to skip excluded tokens, by default False

TYPE: bool

ignore_space_tokens

Whether to skip space tokens, by default False

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Tuple[List[int], List[int]]

An alignment tuple: clean start/end offsets lists.