Skip to content

Converters

Data can be read from and writen to various sources, like JSON/BRAT/CSV files or dataframes, which expect a key-value representation and not Doc object. For that purpose, we document here a set of converters that can be used to convert between these representations and Doc objects.

Converters can be configured in the from_* (or read_* in the case of files) and to_* (or write_* in the case of files) methods, depending on the chosen converter argument, which can be:

  • a function, in which case it will be interpreted as a custom converter
  • a string, in which case it will be interpreted as the name of a pre-defined converter

No converter (converter=None)

Except in read_standoff and write_standoff, the default converter is None. When converter=None, readers output the raw content of the input data (most often dictionaries) and writers expect dictionaries. This can actually be useful is you plan to use Streams without converting to Doc objects, for instance to parallelizing the execution of a function on raw Json, Parquet files or simple lists.

import edsnlp.data


def complex_func(n):
    return n * n


stream = edsnlp.data.from_iterable(range(20))
stream = stream.map(complex_func)
stream = stream.set_processing(num_cpu_workers=2)
res = list(stream)

Custom converter

You can always define your own converter functions to convert between your data and Doc objects.

Reading from a custom schema

import edsnlp, edsnlp.pipes as eds
from spacy.tokens import Doc
from edsnlp.data.converters import get_current_tokenizer
from typing import Dict

def convert_row_to_dict(row: Dict) -> Doc:
    # Tokenizer will be inferred from the pipeline
    doc = get_current_tokenizer()(row["custom_content"])
    doc._.note_id = row["custom_id"]
    doc._.note_datetime = row["custom_datetime"]
    # ...
    return doc

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.covid())

# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.from_pandas(
    # Path to the file or directory
    dataframe,
    # How to convert JSON-like samples to Doc objects
    converter=convert_row_to_dict,
)
docs = docs.map_pipeline(nlp)

Writing to a custom schema

def convert_doc_to_row(doc: Doc) -> Dict:
    return {
        "custom_id": doc._.id,
        "custom_content": doc.text,
        "custom_datetime": doc._.note_datetime,
        # ...
    }

# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
docs.write_parquet(
    "path/to/output_folder",
    # How to convert Doc objects to JSON-like samples
    converter=convert_doc_to_row,
)

One row per entity

This function can also return a list of dicts, for instance one dict per detected entity, that will be treated as multiple rows in dataframe writers (e.g., to_pandas, to_spark, write_parquet).

def convert_ents_to_rows(doc: Doc) -> List[Dict]:
    return [
        {
            "note_id": doc._.id,
            "ent_text": ent.text,
            "ent_label": ent.label_,
            "custom_datetime": doc._.note_datetime,
            # ...
        }
        for ent in doc.ents
    ]


docs.write_parquet(
    "path/to/output_folder",
    # How to convert entities of Doc objects to JSON-like samples
    converter=convert_ents_to_rows,
)

OMOP (converter="omop")

OMOP is a schema that is used in the medical domain. It is based on the OMOP Common Data Model. We are mainly interested in the note table, which contains the clinical notes, and deviate from the original schema by adding an optional entities column that can be computed from the note_nlp table.

Therefore, a complete OMOP-style document would look like this:

{
  "note_id": 0,
  "note_text": "Le patient ...",
  "entities": [
    {
      "note_nlp_id": 0,
      "start_char": 3,
      "end_char": 10,
      "lexical_variant": "patient",
      "note_nlp_source_value": "person",

      # optional fields
      "negated": False,
      "certainty": "probable",
      ...
    },
    ...
  ],

  # optional fields
  "custom_doc_field": "..."
  ...
}

Converting OMOP data to Doc objects[source]

Examples

# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.from_pandas(
    df,
    converter="omop",

    # Optional parameters
    tokenizer=tokenizer,
    doc_attributes=["note_datetime"],

    # Parameters below should only matter if you plan to import entities
    # from the dataframe. If the data doesn't contain pre-annotated
    # entities, you can ignore these.
    span_setter={"ents": True, "*": True},
    span_attributes={"negation": "negated"},
    default_attributes={"negated": False, "temporality": "present"},
)

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object (optional and likely not needed, prefer to use the tokenizer directly argument instead).

tokenizer

The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :

  • the tokenizer of the next pipeline run by .map_pipeline in a Stream.
  • or the eds tokenizer by default.

TYPE: Optional[Tokenizer] DEFAULT: None

span_setter

The span setter to use when setting the spans in the documents. Defaults to setting the spans in the ents attribute, and creates a new span group for each JSON entity label.

TYPE: SpanSetterArg DEFAULT: {'ents': True, '*': True}

doc_attributes

Mapping from JSON attributes to additional Span extensions (can be a list too). By default, all attributes are imported as Doc extensions with the same name.

TYPE: AttributesMappingArg DEFAULT: {'note_datetime': 'note_datetime'}

span_attributes

Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name.

TYPE: Optional[AttributesMappingArg] DEFAULT: None

default_attributes

How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time.

TYPE: AttributesMappingArg DEFAULT: {}

Converting Doc objects to OMOP data[source]

Examples

# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
df = edsnlp.data.to_pandas(
    docs,
    converter="omop",

    # Optional parameters
    span_getter={"ents": True},
    doc_attributes=["note_datetime"],
    span_attributes=["negation", "family"],
)
# or docs.to_pandas(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: SpanGetterArg DEFAULT: {'ents': True}

doc_attributes

Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported.

TYPE: AttributesMappingArg DEFAULT: {}

Standoff (converter="standoff")

Standoff refers mostly to the BRAT standoff format, but doesn't indicate how the annotations should be stored in a JSON-like schema. We use the following schema:

{
  "doc_id": 0,
  "text": "Le patient ...",
  "entities": [
    {
      "entity_id": 0,
      "label": "drug",
      "fragments": [{
        "start": 0,
        "end": 10
      }],
      "attributes": {
        "negated": True,
        "certainty": "probable"
      }
    },
    ...
  ]
}

Converting Standoff data to Doc objects[source]

Why does BRAT/Standoff need a converter ?

You may wonder : why do I need a converter ? Since BRAT is already a NLP oriented format, it should be straightforward to convert it to a Doc object.

Indeed, we do provide a default converter for the BRAT standoff format, but we also acknowledge that there may be more than one way to convert a standoff document to a Doc object. For instance, an annotated span may be used to represent a relation between two smaller included entities, or another entity scope, etc.

In such cases, we recommend you use a custom converter as described here.

Examples

# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.read_standoff(
    "path/to/standoff",
    converter="standoff",  # set by default

    # Optional parameters
    tokenizer=tokenizer,
    span_setter={"ents": True, "*": True},
    span_attributes={"negation": "negated"},
    keep_raw_attribute_values=False,
    default_attributes={"negated": False, "temporality": "present"},
)

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object (optional and likely not needed, prefer to use the tokenizer directly argument instead).

tokenizer

The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :

  • the tokenizer of the next pipeline run by .map_pipeline in a Stream.
  • or the eds tokenizer by default.

TYPE: Optional[Tokenizer] DEFAULT: None

span_setter

The span setter to use when setting the spans in the documents. Defaults to setting the spans in the ents attribute, and creates a new span group for each JSON entity label.

TYPE: SpanSetterArg DEFAULT: {'ents': True, '*': True}

span_attributes

Mapping from BRAT attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name.

TYPE: Optional[AttributesMappingArg] DEFAULT: None

keep_raw_attribute_values

Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans).

TYPE: bool DEFAULT: False

default_attributes

How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time.

TYPE: AttributesMappingArg DEFAULT: {}

notes_as_span_attribute

If set, the AnnotatorNote annotations will be concatenated and stored in a span attribute with this name.

TYPE: Optional[str] DEFAULT: None

split_fragments

Whether to split the fragments into separate spans or not. If set to False, the fragments will be concatenated into a single span.

TYPE: bool DEFAULT: True

Converting Doc objects to Standoff data[source]

Examples

# Any kind of writer (`edsnlp.data.read/from_...`) can be used here
edsnlp.data.write_standoff(
    docs,
    converter="standoff",  # set by default

    # Optional parameters
    span_getter={"ents": True},
    span_attributes=["negation"],
)
# or docs.to_standoff(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: Optional[SpanGetterArg] DEFAULT: {'ents': True}

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}

Entities (converter="ents")[source]

We also provide a simple one-way (export) converter to convert Doc into a list of dictionaries, one per entity, that can be used to write to a dataframe. The schema of each produced row is the following:

{
    "note_id": 0,
    "start": 3,
    "end": 10,
    "label": "drug",
    "lexical_variant": "patient",

    # Optional fields
    "negated": False,
    "certainty": "probable"
    ...
}

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: SpanGetterArg DEFAULT: {'ents': True}

doc_attributes

Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported.

TYPE: AttributesMappingArg DEFAULT: {}