Skip to content

edsnlp.data.converters

Converters are used to convert documents between python dictionaries and Doc objects. There are two types of converters: readers and writers. Readers convert dictionaries to Doc objects, and writers convert Doc objects to dictionaries.

AttributesMappingArg

Bases: Validated

A span attribute mapping (can be a list too to keep the same names).

For instance:

  • doc_attributes="note_datetime" will map the note_datetime JSON attribute to the note_datetime extension.
  • span_attributes=["negation", "family"] will map the negation and family JSON attributes to the negation and family extensions.

StandoffDict2DocConverter [source]

Why does BRAT/Standoff need a converter ?

You may wonder : why do I need a converter ? Since BRAT is already a NLP oriented format, it should be straightforward to convert it to a Doc object.

Indeed, we do provide a default converter for the BRAT standoff format, but we also acknowledge that there may be more than one way to convert a standoff document to a Doc object. For instance, an annotated span may be used to represent a relation between two smaller included entities, or another entity scope, etc.

In such cases, we recommend you use a custom converter as described here.

Examples

# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.read_standoff(
    "path/to/standoff",
    converter="standoff",  # set by default

    # Optional parameters
    tokenizer=tokenizer,
    span_setter={"ents": True, "*": True},
    span_attributes={"negation": "negated"},
    keep_raw_attribute_values=False,
    default_attributes={"negated": False, "temporality": "present"},
)

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object (optional and likely not needed, prefer to use the tokenizer directly argument instead).

tokenizer

The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :

  • the tokenizer of the next pipeline run by .map_pipeline in a Stream.
  • or the eds tokenizer by default.

TYPE: Optional[Tokenizer] DEFAULT: None

span_setter

The span setter to use when setting the spans in the documents. Defaults to setting the spans in the ents attribute, and creates a new span group for each JSON entity label.

TYPE: SpanSetterArg DEFAULT: {'ents': True, '*': True}

span_attributes

Mapping from BRAT attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name.

TYPE: Optional[AttributesMappingArg] DEFAULT: None

keep_raw_attribute_values

Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans).

TYPE: bool DEFAULT: False

default_attributes

How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time.

TYPE: AttributesMappingArg DEFAULT: {}

notes_as_span_attribute

If set, the AnnotatorNote annotations will be concatenated and stored in a span attribute with this name.

TYPE: Optional[str] DEFAULT: None

split_fragments

Whether to split the fragments into separate spans or not. If set to False, the fragments will be concatenated into a single span.

TYPE: bool DEFAULT: True

StandoffDoc2DictConverter [source]

Examples

# Any kind of writer (`edsnlp.data.read/from_...`) can be used here
edsnlp.data.write_standoff(
    docs,
    converter="standoff",  # set by default

    # Optional parameters
    span_getter={"ents": True},
    span_attributes=["negation"],
)
# or docs.to_standoff(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: Optional[SpanGetterArg] DEFAULT: {'ents': True}

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}

ConllDict2DocConverter [source]

TODO

OmopDict2DocConverter [source]

Examples

# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.from_pandas(
    df,
    converter="omop",

    # Optional parameters
    tokenizer=tokenizer,
    doc_attributes=["note_datetime"],

    # Parameters below should only matter if you plan to import entities
    # from the dataframe. If the data doesn't contain pre-annotated
    # entities, you can ignore these.
    span_setter={"ents": True, "*": True},
    span_attributes={"negation": "negated"},
    default_attributes={"negated": False, "temporality": "present"},
)

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object (optional and likely not needed, prefer to use the tokenizer directly argument instead).

tokenizer

The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :

  • the tokenizer of the next pipeline run by .map_pipeline in a Stream.
  • or the eds tokenizer by default.

TYPE: Optional[Tokenizer] DEFAULT: None

span_setter

The span setter to use when setting the spans in the documents. Defaults to setting the spans in the ents attribute, and creates a new span group for each JSON entity label.

TYPE: SpanSetterArg DEFAULT: {'ents': True, '*': True}

doc_attributes

Mapping from JSON attributes to additional Span extensions (can be a list too). By default, all attributes are imported as Doc extensions with the same name.

TYPE: AttributesMappingArg DEFAULT: {'note_datetime': 'note_datetime'}

span_attributes

Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name.

TYPE: Optional[AttributesMappingArg] DEFAULT: None

default_attributes

How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time.

TYPE: AttributesMappingArg DEFAULT: {}

OmopDoc2DictConverter [source]

Examples

# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
df = edsnlp.data.to_pandas(
    docs,
    converter="omop",

    # Optional parameters
    span_getter={"ents": True},
    doc_attributes=["note_datetime"],
    span_attributes=["negation", "family"],
)
# or docs.to_pandas(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: SpanGetterArg DEFAULT: {'ents': True}

doc_attributes

Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported.

TYPE: AttributesMappingArg DEFAULT: {}

EntsDoc2DictConverter [source]

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: SpanGetterArg DEFAULT: {'ents': True}

doc_attributes

Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported.

TYPE: AttributesMappingArg DEFAULT: {}

MarkupToDocConverter [source]

Examples

import edsnlp

# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
# If input items are dicts, the converter expects a "text" key/column.
docs = list(
    edsnlp.data.from_iterable(
        [
            "This [is](VERB negation=True) not a [test](NOUN).",
            "This is another [test](NOUN).",
        ],
        converter="markup",
        span_setter="entities",
    ),
)
print(docs[0].spans["entities"])
# Out: [is, test]

You can also use it directly on a string:

from edsnlp.data.converters import MarkupToDocConverter

converter = MarkupToDocConverter(
    span_setter={"verb": "VERB", "noun": "NOUN"},
    preset="xml",
)
doc = converter("This <VERB negation=True>is</VERB> not a <NOUN>test</NOUN>.")
print(doc.spans["verb"])
# Out: [is]
print(doc.spans["verb"][0]._.negation)
# Out: True

Parameters

PARAMETER DESCRIPTION
preset

The preset to use for the markup format. Defaults to "md" (Markdown-like syntax). Use "xml" for XML-like syntax.

TYPE: Literal['md', 'xml'] DEFAULT: 'md'

opener

The regex pattern to match the opening tag of the markup. Defaults to the preset's opener.

TYPE: Optional[str] DEFAULT: None

closer

The regex pattern to match the closing tag of the markup. Defaults to the preset's closer.

TYPE: Optional[str] DEFAULT: None

tokenizer

The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :

  • the tokenizer of the next pipeline run by .map_pipeline in a Stream.
  • or the eds tokenizer by default.

TYPE: Optional[Tokenizer] DEFAULT: None

span_setter

The span setter to use when setting the spans in the documents. Defaults to setting the spans in the ents attribute and creates a new span group for each JSON entity label.

TYPE: SpanSetterArg DEFAULT: {'ents': True, '*': True}

span_attributes

Mapping from markup attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name.

TYPE: Optional[AttributesMappingArg] DEFAULT: None

keep_raw_attribute_values

Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans).

TYPE: bool DEFAULT: False

default_attributes

How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time.

TYPE: AttributesMappingArg DEFAULT: {}

bool_attributes

List of boolean attributes to set to False by default. This is useful for attributes that are often not annotated, but you want to have a default value for them.

TYPE: AsList[str] DEFAULT: []

DocToMarkupConverter [source]

Convert a Doc to a string with inline markup.

This is the inverse of :class:MarkupToDocConverter. It renders selected spans as either Markdown-like tags ([text](LABEL key=val ...)) or XML-like tags (<LABEL key=val ...>text</LABEL>).

Parameters

PARAMETER DESCRIPTION
span_getter

Which spans to render from the document.

TYPE: SpanGetterArg DEFAULT: {"ents": True}

span_attributes

Mapping from Span extensions (or builtins like label_, kb_id_) to attribute names in the rendered markup. Only attributes with a non-None value are emitted.

TYPE: AttributesMappingArg DEFAULT: {}

default_attributes

When an attribute equals its provided default value, it is omitted from the output (e.g., avoid printing negated=False when False is the default).

TYPE: AttributesMappingArg DEFAULT: {}

preset

Output syntax. "md" produces the Markdown‑like form, "xml" the XML‑like form.

TYPE: Literal['md', 'xml'] DEFAULT: "md"

HfTextDict2DocConverter [source]

Converter for HuggingFace datasets where each example is a single text field.

This converter expects the dataset examples to contain a single column with the document text (default: "text"). It tokenizes the text using the provided tokenizer (or the current context tokenizer) and returns a Doc object. If the example contains an id column (default: "id") it will be stored as doc._.note_id.

Examples

import edsnlp

docs = edsnlp.data.from_huggingface_dataset(
    "wikimedia/wikipedia",
    name="20231101.ady",
    split="train",
    converter="hf_text",
    id_column="id",
    text_column="text",
)

Parameters

PARAMETER DESCRIPTION
tokenizer

The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :

  • the tokenizer of the next pipeline run by .map_pipeline in a Stream.
  • or the eds tokenizer by default.

TYPE: Optional[Tokenizer] DEFAULT: None

text_column

Column name containing the document text.

TYPE: str

id_column

Column name containing the document id.

TYPE: Optional[str] DEFAULT: None

HfTextDoc2DictConverter [source]

Doc -> dict converter for simple text datasets.

Outputs a dict with the configured id_column and text_column.

HfNerDict2DocConverter [source]

Converter for HuggingFace NER datasets (e.g., WikiNER, CoNLL-2003).

Examples

import edsnlp

docs = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    id_column="id",
    words_column="tokens",
    ner_tags_column="ner_tags",
    tag_order=[
        "O",
        "B-PER",
        "I-PER",
        "B-ORG",
        "I-ORG",
        "B-LOC",
        "I-LOC",
        "B-MISC",
        "I-MISC",
    ],
    converter="hf_ner",
)

Parameters

PARAMETER DESCRIPTION
tokenizer

Optional tokenizer.

TYPE: Optional[Tokenizer] DEFAULT: None

words_column

Column with token words.

TYPE: str

ner_tags_column

Column with token-level tags.

TYPE: str

id_column

Column to use for doc id.

TYPE: Optional[str] DEFAULT: None

tag_map

Mapping/index-to-label for tag ids. If provided, it is used as-is. If not provided, you may pass tag_order, a sequence of labels (e.g. ['O','B-PER','I-PER', ...]) to construct the mapping via {i: label for i, label in enumerate(tag_order)}. If neither is provided, labels are stringified.

TYPE: Optional[Mapping[Any, str]] DEFAULT: None

tag_order

Optional sequence of labels used to build tag_map when tag_map is not provided.

TYPE: Optional[Sequence[str]] DEFAULT: None

span_setter

Span setter (defaults to {"ents": True}).

TYPE: Optional[SpanSetterArg] DEFAULT: None

HfNerDoc2DictConverter [source]

Doc -> dict converter for token-level NER datasets used by HuggingFace.

Produces a dict with token list in words_column, token tags in ner_tags_column, and an identifier in id_column.