Converters
Data can be read from and writen to various sources, like JSON/BRAT/CSV files or dataframes, which expect a key-value representation and not Doc object. For that purpose, we document here a set of converters that can be used to convert between these representations and Doc objects.
Converters can be configured in the from_*
(or read_*
in the case of files) and to_*
(or write_*
in the case of files) methods, depending on the chosen converter
argument, which can be:
- a function, in which case it will be interpreted as a custom converter
- a string, in which case it will be interpreted as the name of a pre-defined converter
No converter (converter=None
)
Except in read_standoff
and write_standoff
, the default converter is None
. When converter=None
, readers output the raw content of the input data (most often dictionaries) and writers expect dictionaries. This can actually be useful is you plan to use Streams without converting to Doc objects, for instance to parallelizing the execution of a function on raw Json, Parquet files or simple lists.
import edsnlp.data
def complex_func(n):
return n * n
stream = edsnlp.data.from_iterable(range(20))
stream = stream.map(complex_func)
stream = stream.set_processing(num_cpu_workers=2)
res = list(stream)
Custom converter
You can always define your own converter functions to convert between your data and Doc objects.
Reading from a custom schema
import edsnlp, edsnlp.pipes as eds
from spacy.tokens import Doc
from edsnlp.data.converters import get_current_tokenizer
from typing import Dict
def convert_row_to_dict(row: Dict) -> Doc:
# Tokenizer will be inferred from the pipeline
doc = get_current_tokenizer()(row["custom_content"])
doc._.note_id = row["custom_id"]
doc._.note_datetime = row["custom_datetime"]
# ...
return doc
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.covid())
# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.from_pandas(
# Path to the file or directory
dataframe,
# How to convert JSON-like samples to Doc objects
converter=convert_row_to_dict,
)
docs = docs.map_pipeline(nlp)
Writing to a custom schema
def convert_doc_to_row(doc: Doc) -> Dict:
return {
"custom_id": doc._.id,
"custom_content": doc.text,
"custom_datetime": doc._.note_datetime,
# ...
}
# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
docs.write_parquet(
"path/to/output_folder",
# How to convert Doc objects to JSON-like samples
converter=convert_doc_to_row,
)
One row per entity
This function can also return a list of dicts, for instance one dict per detected entity, that will be treated as multiple rows in dataframe writers (e.g., to_pandas
, to_spark
, write_parquet
).
def convert_ents_to_rows(doc: Doc) -> List[Dict]:
return [
{
"note_id": doc._.id,
"ent_text": ent.text,
"ent_label": ent.label_,
"custom_datetime": doc._.note_datetime,
# ...
}
for ent in doc.ents
]
docs.write_parquet(
"path/to/output_folder",
# How to convert entities of Doc objects to JSON-like samples
converter=convert_ents_to_rows,
)
OMOP (converter="omop"
)
OMOP is a schema that is used in the medical domain. It is based on the OMOP Common Data Model. We are mainly interested in the note
table, which contains the clinical notes, and deviate from the original schema by adding an optional entities
column that can be computed from the note_nlp
table.
Therefore, a complete OMOP-style document would look like this:
{
"note_id": 0,
"note_text": "Le patient ...",
"entities": [
{
"note_nlp_id": 0,
"start_char": 3,
"end_char": 10,
"lexical_variant": "patient",
"note_nlp_source_value": "person",
# optional fields
"negated": False,
"certainty": "probable",
...
},
...
],
# optional fields
"custom_doc_field": "..."
...
}
Converting OMOP data to Doc objects[source]
Examples
# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.from_pandas(
df,
converter="omop",
# Optional parameters
tokenizer=tokenizer,
doc_attributes=["note_datetime"],
# Parameters below should only matter if you plan to import entities
# from the dataframe. If the data doesn't contain pre-annotated
# entities, you can ignore these.
span_setter={"ents": True, "*": True},
span_attributes={"negation": "negated"},
default_attributes={"negated": False, "temporality": "present"},
)
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object (optional and likely not needed, prefer to use the
|
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
span_setter | The span setter to use when setting the spans in the documents. Defaults to setting the spans in the TYPE: |
doc_attributes | Mapping from JSON attributes to additional Span extensions (can be a list too). By default, all attributes are imported as Doc extensions with the same name. TYPE: |
span_attributes | Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: |
default_attributes | How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time. TYPE: |
Converting Doc objects to OMOP data[source]
Examples
# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
df = edsnlp.data.to_pandas(
docs,
converter="omop",
# Optional parameters
span_getter={"ents": True},
doc_attributes=["note_datetime"],
span_attributes=["negation", "family"],
)
# or docs.to_pandas(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]
Parameters
PARAMETER | DESCRIPTION |
---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
doc_attributes | Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported. TYPE: |
Standoff (converter="standoff"
)
Standoff refers mostly to the BRAT standoff format, but doesn't indicate how the annotations should be stored in a JSON-like schema. We use the following schema:
{
"doc_id": 0,
"text": "Le patient ...",
"entities": [
{
"entity_id": 0,
"label": "drug",
"fragments": [{
"start": 0,
"end": 10
}],
"attributes": {
"negated": True,
"certainty": "probable"
}
},
...
]
}
Converting Standoff data to Doc objects[source]
Why does BRAT/Standoff need a converter ?
You may wonder : why do I need a converter ? Since BRAT is already a NLP oriented format, it should be straightforward to convert it to a Doc object.
Indeed, we do provide a default converter for the BRAT standoff format, but we also acknowledge that there may be more than one way to convert a standoff document to a Doc object. For instance, an annotated span may be used to represent a relation between two smaller included entities, or another entity scope, etc.
In such cases, we recommend you use a custom converter as described here.
Examples
# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.read_standoff(
"path/to/standoff",
converter="standoff", # set by default
# Optional parameters
tokenizer=tokenizer,
span_setter={"ents": True, "*": True},
span_attributes={"negation": "negated"},
keep_raw_attribute_values=False,
default_attributes={"negated": False, "temporality": "present"},
)
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object (optional and likely not needed, prefer to use the
|
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
span_setter | The span setter to use when setting the spans in the documents. Defaults to setting the spans in the TYPE: |
span_attributes | Mapping from BRAT attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: |
keep_raw_attribute_values | Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans). TYPE: |
default_attributes | How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time. TYPE: |
notes_as_span_attribute | If set, the AnnotatorNote annotations will be concatenated and stored in a span attribute with this name. TYPE: |
split_fragments | Whether to split the fragments into separate spans or not. If set to False, the fragments will be concatenated into a single span. TYPE: |
Converting Doc objects to Standoff data[source]
Examples
# Any kind of writer (`edsnlp.data.read/from_...`) can be used here
edsnlp.data.write_standoff(
docs,
converter="standoff", # set by default
# Optional parameters
span_getter={"ents": True},
span_attributes=["negation"],
)
# or docs.to_standoff(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]
Parameters
PARAMETER | DESCRIPTION |
---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported, except TYPE: |
Entities (converter="ents"
)[source]
We also provide a simple one-way (export) converter to convert Doc into a list of dictionaries, one per entity, that can be used to write to a dataframe. The schema of each produced row is the following:
{
"note_id": 0,
"start": 3,
"end": 10,
"label": "drug",
"lexical_variant": "patient",
# Optional fields
"negated": False,
"certainty": "probable"
...
}
Parameters
PARAMETER | DESCRIPTION |
---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
doc_attributes | Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported. TYPE: |