edsnlp.utils.doc_to_text
aggregate_tokens
cached
Aggregate tokens strings, computed from their attr
attribute, into a single string, possibly ignoring excluded tokens (like pollution tokens) and/or space tokens. This also returns the start and end offsets of each token in the aggregated string, as well as a bytes array indicating which tokens were kept. The reason for the bytes array is that it is faster to index, and allows reverse indexing as well.
Parameters
PARAMETER | DESCRIPTION |
---|---|
doc | TYPE: |
attr | TYPE: |
ignore_excluded | TYPE: |
ignore_space_tokens | TYPE: |
RETURNS | DESCRIPTION |
---|---|
Tuple[str, List[int], List[int], bytes] | The aggregated text, the start offsets, the end offsets, and the bytes array indicating which tokens were kept. |
get_text
[source]
Get text using a custom attribute, possibly ignoring excluded tokens.
Parameters
PARAMETER | DESCRIPTION |
---|---|
doclike | Doc or Span to get text from. TYPE: |
attr | Attribute to use. TYPE: |
ignore_excluded | Whether to skip excluded tokens, by default False TYPE: |
ignore_space_tokens | Whether to skip space tokens, by default False TYPE: |
RETURNS | DESCRIPTION |
---|---|
str | Extracted text. |
get_char_offsets
[source]
Get char offsets of the doc tokens in the "cleaned" text.
Parameters
PARAMETER | DESCRIPTION |
---|---|
doclike | Doc or Span to get text from. TYPE: |
attr | Attribute to use. TYPE: |
ignore_excluded | Whether to skip excluded tokens, by default False TYPE: |
ignore_space_tokens | Whether to skip space tokens, by default False TYPE: |
RETURNS | DESCRIPTION |
---|---|
Tuple[List[int], List[int]] | An alignment tuple: clean start/end offsets lists. |