`edsnlp.utils.doc_to_text`

`aggregate_tokens` `cached`

Aggregate tokens strings, computed from their attr attribute, into a single string, possibly ignoring excluded tokens (like pollution tokens) and/or space tokens. This also returns the start and end offsets of each token in the aggregated string, as well as a bytes array indicating which tokens were kept. The reason for the bytes array is that it is faster to index, and allows reverse indexing as well.

Parameters

PARAMETER	DESCRIPTION
`doc`	TYPE: `Doc`
`attr`	TYPE: `str`
`ignore_excluded`	TYPE: `bool` DEFAULT: `False`
`ignore_space_tokens`	TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Tuple[str, List[int], List[int], bytes]`	The aggregated text, the start offsets, the end offsets, and the bytes array indicating which tokens were kept.

`get_text` [source]

Get text using a custom attribute, possibly ignoring excluded tokens.

Parameters

PARAMETER	DESCRIPTION
`doclike`	Doc or Span to get text from. TYPE: `Union[Doc, Span]`
`attr`	Attribute to use. TYPE: `str`
`ignore_excluded`	Whether to skip excluded tokens, by default False TYPE: `bool`
`ignore_space_tokens`	Whether to skip space tokens, by default False TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`str`	Extracted text.

`get_char_offsets` [source]

Get char offsets of the doc tokens in the "cleaned" text.

Parameters

PARAMETER	DESCRIPTION
`doclike`	Doc or Span to get text from. TYPE: `Union[Doc, Span]`
`attr`	Attribute to use. TYPE: `str`
`ignore_excluded`	Whether to skip excluded tokens, by default False TYPE: `bool`
`ignore_space_tokens`	Whether to skip space tokens, by default False TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Tuple[List[int], List[int]]`	An alignment tuple: clean start/end offsets lists.

edsnlp.utils.doc_to_text

aggregate_tokens cached

Parameters

get_text [source]

Parameters

get_char_offsets [source]

Parameters

`edsnlp.utils.doc_to_text`

`aggregate_tokens` `cached`

`get_text` [source]

`get_char_offsets` [source]