edsnlp.data.converters
Converters are used to convert documents between python dictionaries and Doc objects. There are two types of converters: readers and writers. Readers convert dictionaries to Doc objects, and writers convert Doc objects to dictionaries.
AttributesMappingArg
Bases: Validated
A span attribute mapping (can be a list too to keep the same names).
For instance:
doc_attributes="note_datetime"
will map thenote_datetime
JSON attribute to thenote_datetime
extension.span_attributes=["negation", "family"]
will map thenegation
andfamily
JSON attributes to thenegation
andfamily
extensions.
StandoffDict2DocConverter
[source]
Why does BRAT/Standoff need a converter ?
You may wonder : why do I need a converter ? Since BRAT is already a NLP oriented format, it should be straightforward to convert it to a Doc object.
Indeed, we do provide a default converter for the BRAT standoff format, but we also acknowledge that there may be more than one way to convert a standoff document to a Doc object. For instance, an annotated span may be used to represent a relation between two smaller included entities, or another entity scope, etc.
In such cases, we recommend you use a custom converter as described here.
Examples
# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.read_standoff(
"path/to/standoff",
converter="standoff", # set by default
# Optional parameters
tokenizer=tokenizer,
span_setter={"ents": True, "*": True},
span_attributes={"negation": "negated"},
keep_raw_attribute_values=False,
default_attributes={"negated": False, "temporality": "present"},
)
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object (optional and likely not needed, prefer to use the
|
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
span_setter | The span setter to use when setting the spans in the documents. Defaults to setting the spans in the TYPE: |
span_attributes | Mapping from BRAT attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: |
keep_raw_attribute_values | Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans). TYPE: |
default_attributes | How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time. TYPE: |
notes_as_span_attribute | If set, the AnnotatorNote annotations will be concatenated and stored in a span attribute with this name. TYPE: |
split_fragments | Whether to split the fragments into separate spans or not. If set to False, the fragments will be concatenated into a single span. TYPE: |
StandoffDoc2DictConverter
[source]
Examples
# Any kind of writer (`edsnlp.data.read/from_...`) can be used here
edsnlp.data.write_standoff(
docs,
converter="standoff", # set by default
# Optional parameters
span_getter={"ents": True},
span_attributes=["negation"],
)
# or docs.to_standoff(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]
Parameters
PARAMETER | DESCRIPTION |
---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported, except TYPE: |
ConllDict2DocConverter
[source]
TODO
OmopDict2DocConverter
[source]
Examples
# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.from_pandas(
df,
converter="omop",
# Optional parameters
tokenizer=tokenizer,
doc_attributes=["note_datetime"],
# Parameters below should only matter if you plan to import entities
# from the dataframe. If the data doesn't contain pre-annotated
# entities, you can ignore these.
span_setter={"ents": True, "*": True},
span_attributes={"negation": "negated"},
default_attributes={"negated": False, "temporality": "present"},
)
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline object (optional and likely not needed, prefer to use the
|
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
span_setter | The span setter to use when setting the spans in the documents. Defaults to setting the spans in the TYPE: |
doc_attributes | Mapping from JSON attributes to additional Span extensions (can be a list too). By default, all attributes are imported as Doc extensions with the same name. TYPE: |
span_attributes | Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: |
default_attributes | How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time. TYPE: |
OmopDoc2DictConverter
[source]
Examples
# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
df = edsnlp.data.to_pandas(
docs,
converter="omop",
# Optional parameters
span_getter={"ents": True},
doc_attributes=["note_datetime"],
span_attributes=["negation", "family"],
)
# or docs.to_pandas(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]
Parameters
PARAMETER | DESCRIPTION |
---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
doc_attributes | Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported. TYPE: |
EntsDoc2DictConverter
[source]
Parameters
PARAMETER | DESCRIPTION |
---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
doc_attributes | Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported. TYPE: |