Skip to content

edsnlp.data.converters

Converters are used to convert documents between python dictionaries and Doc objects. There are two types of converters: readers and writers. Readers convert dictionaries to Doc objects, and writers convert Doc objects to dictionaries.

AttributesMappingArg

Bases: Validated

A span attribute mapping (can be a list too to keep the same names).

For instance:

  • doc_attributes="note_datetime" will map the note_datetime JSON attribute to the note_datetime extension.
  • span_attributes=["negation", "family"] will map the negation and family JSON attributes to the negation and family extensions.

StandoffDict2DocConverter [source]

Why does BRAT/Standoff need a converter ?

You may wonder : why do I need a converter ? Since BRAT is already a NLP oriented format, it should be straightforward to convert it to a Doc object.

Indeed, we do provide a default converter for the BRAT standoff format, but we also acknowledge that there may be more than one way to convert a standoff document to a Doc object. For instance, an annotated span may be used to represent a relation between two smaller included entities, or another entity scope, etc.

In such cases, we recommend you use a custom converter as described here.

Examples

# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.read_standoff(
    "path/to/standoff",
    converter="standoff",  # set by default

    # Optional parameters
    tokenizer=tokenizer,
    span_setter={"ents": True, "*": True},
    span_attributes={"negation": "negated"},
    keep_raw_attribute_values=False,
    default_attributes={"negated": False, "temporality": "present"},
)

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object (optional and likely not needed, prefer to use the tokenizer directly argument instead).

tokenizer

The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :

  • the tokenizer of the next pipeline run by .map_pipeline in a Stream.
  • or the eds tokenizer by default.

TYPE: Optional[Tokenizer] DEFAULT: None

span_setter

The span setter to use when setting the spans in the documents. Defaults to setting the spans in the ents attribute, and creates a new span group for each JSON entity label.

TYPE: SpanSetterArg DEFAULT: {'ents': True, '*': True}

span_attributes

Mapping from BRAT attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name.

TYPE: Optional[AttributesMappingArg] DEFAULT: None

keep_raw_attribute_values

Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans).

TYPE: bool DEFAULT: False

default_attributes

How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time.

TYPE: AttributesMappingArg DEFAULT: {}

notes_as_span_attribute

If set, the AnnotatorNote annotations will be concatenated and stored in a span attribute with this name.

TYPE: Optional[str] DEFAULT: None

split_fragments

Whether to split the fragments into separate spans or not. If set to False, the fragments will be concatenated into a single span.

TYPE: bool DEFAULT: True

StandoffDoc2DictConverter [source]

Examples

# Any kind of writer (`edsnlp.data.read/from_...`) can be used here
edsnlp.data.write_standoff(
    docs,
    converter="standoff",  # set by default

    # Optional parameters
    span_getter={"ents": True},
    span_attributes=["negation"],
)
# or docs.to_standoff(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: Optional[SpanGetterArg] DEFAULT: {'ents': True}

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}

ConllDict2DocConverter [source]

TODO

OmopDict2DocConverter [source]

Examples

# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.from_pandas(
    df,
    converter="omop",

    # Optional parameters
    tokenizer=tokenizer,
    doc_attributes=["note_datetime"],

    # Parameters below should only matter if you plan to import entities
    # from the dataframe. If the data doesn't contain pre-annotated
    # entities, you can ignore these.
    span_setter={"ents": True, "*": True},
    span_attributes={"negation": "negated"},
    default_attributes={"negated": False, "temporality": "present"},
)

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object (optional and likely not needed, prefer to use the tokenizer directly argument instead).

tokenizer

The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :

  • the tokenizer of the next pipeline run by .map_pipeline in a Stream.
  • or the eds tokenizer by default.

TYPE: Optional[Tokenizer] DEFAULT: None

span_setter

The span setter to use when setting the spans in the documents. Defaults to setting the spans in the ents attribute, and creates a new span group for each JSON entity label.

TYPE: SpanSetterArg DEFAULT: {'ents': True, '*': True}

doc_attributes

Mapping from JSON attributes to additional Span extensions (can be a list too). By default, all attributes are imported as Doc extensions with the same name.

TYPE: AttributesMappingArg DEFAULT: {'note_datetime': 'note_datetime'}

span_attributes

Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name.

TYPE: Optional[AttributesMappingArg] DEFAULT: None

default_attributes

How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time.

TYPE: AttributesMappingArg DEFAULT: {}

OmopDoc2DictConverter [source]

Examples

# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
df = edsnlp.data.to_pandas(
    docs,
    converter="omop",

    # Optional parameters
    span_getter={"ents": True},
    doc_attributes=["note_datetime"],
    span_attributes=["negation", "family"],
)
# or docs.to_pandas(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: SpanGetterArg DEFAULT: {'ents': True}

doc_attributes

Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported.

TYPE: AttributesMappingArg DEFAULT: {}

EntsDoc2DictConverter [source]

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: SpanGetterArg DEFAULT: {'ents': True}

doc_attributes

Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported.

TYPE: AttributesMappingArg DEFAULT: {}