edsnlp.data.converters
Converters are used to convert documents between python dictionaries and Doc objects. There are two types of converters: readers and writers. Readers convert dictionaries to Doc objects, and writers convert Doc objects to dictionaries.
AttributesMappingArg
Bases: Validated
A span attribute mapping (can be a list too to keep the same names).
For instance:
doc_attributes="note_datetime"will map thenote_datetimeJSON attribute to thenote_datetimeextension.span_attributes=["negation", "family"]will map thenegationandfamilyJSON attributes to thenegationandfamilyextensions.
StandoffDict2DocConverter [source]
Why does BRAT/Standoff need a converter ?
You may wonder : why do I need a converter ? Since BRAT is already a NLP oriented format, it should be straightforward to convert it to a Doc object.
Indeed, we do provide a default converter for the BRAT standoff format, but we also acknowledge that there may be more than one way to convert a standoff document to a Doc object. For instance, an annotated span may be used to represent a relation between two smaller included entities, or another entity scope, etc.
In such cases, we recommend you use a custom converter as described here.
Examples
# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.read_standoff(
"path/to/standoff",
converter="standoff", # set by default
# Optional parameters
tokenizer=tokenizer,
span_setter={"ents": True, "*": True},
span_attributes={"negation": "negated"},
keep_raw_attribute_values=False,
default_attributes={"negated": False, "temporality": "present"},
)
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
nlp | The pipeline object (optional and likely not needed, prefer to use the
|
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
span_setter | The span setter to use when setting the spans in the documents. Defaults to setting the spans in the TYPE: |
span_attributes | Mapping from BRAT attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: |
keep_raw_attribute_values | Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans). TYPE: |
default_attributes | How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time. TYPE: |
notes_as_span_attribute | If set, the AnnotatorNote annotations will be concatenated and stored in a span attribute with this name. TYPE: |
split_fragments | Whether to split the fragments into separate spans or not. If set to False, the fragments will be concatenated into a single span. TYPE: |
StandoffDoc2DictConverter [source]
Examples
# Any kind of writer (`edsnlp.data.read/from_...`) can be used here
edsnlp.data.write_standoff(
docs,
converter="standoff", # set by default
# Optional parameters
span_getter={"ents": True},
span_attributes=["negation"],
)
# or docs.to_standoff(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported, except TYPE: |
ConllDict2DocConverter [source]
TODO
OmopDict2DocConverter [source]
Examples
# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.from_pandas(
df,
converter="omop",
# Optional parameters
tokenizer=tokenizer,
doc_attributes=["note_datetime"],
# Parameters below should only matter if you plan to import entities
# from the dataframe. If the data doesn't contain pre-annotated
# entities, you can ignore these.
span_setter={"ents": True, "*": True},
span_attributes={"negation": "negated"},
default_attributes={"negated": False, "temporality": "present"},
)
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
nlp | The pipeline object (optional and likely not needed, prefer to use the
|
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
span_setter | The span setter to use when setting the spans in the documents. Defaults to setting the spans in the TYPE: |
doc_attributes | Mapping from JSON attributes to additional Span extensions (can be a list too). By default, all attributes are imported as Doc extensions with the same name. TYPE: |
span_attributes | Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: |
default_attributes | How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time. TYPE: |
OmopDoc2DictConverter [source]
Examples
# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
df = edsnlp.data.to_pandas(
docs,
converter="omop",
# Optional parameters
span_getter={"ents": True},
doc_attributes=["note_datetime"],
span_attributes=["negation", "family"],
)
# or docs.to_pandas(...) if it's already a
# [Stream][edsnlp.core.stream.Stream]
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
doc_attributes | Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported. TYPE: |
EntsDoc2DictConverter [source]
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
doc_attributes | Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported. TYPE: |
MarkupToDocConverter [source]
Examples
import edsnlp
# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
# If input items are dicts, the converter expects a "text" key/column.
docs = list(
edsnlp.data.from_iterable(
[
"This [is](VERB negation=True) not a [test](NOUN).",
"This is another [test](NOUN).",
],
converter="markup",
span_setter="entities",
),
)
print(docs[0].spans["entities"])
# Out: [is, test]
You can also use it directly on a string:
from edsnlp.data.converters import MarkupToDocConverter
converter = MarkupToDocConverter(
span_setter={"verb": "VERB", "noun": "NOUN"},
preset="xml",
)
doc = converter("This <VERB negation=True>is</VERB> not a <NOUN>test</NOUN>.")
print(doc.spans["verb"])
# Out: [is]
print(doc.spans["verb"][0]._.negation)
# Out: True
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
preset | The preset to use for the markup format. Defaults to "md" (Markdown-like syntax). Use "xml" for XML-like syntax. TYPE: |
opener | The regex pattern to match the opening tag of the markup. Defaults to the preset's opener. TYPE: |
closer | The regex pattern to match the closing tag of the markup. Defaults to the preset's closer. TYPE: |
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
span_setter | The span setter to use when setting the spans in the documents. Defaults to setting the spans in the TYPE: |
span_attributes | Mapping from markup attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: |
keep_raw_attribute_values | Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans). TYPE: |
default_attributes | How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time. TYPE: |
bool_attributes | List of boolean attributes to set to False by default. This is useful for attributes that are often not annotated, but you want to have a default value for them. TYPE: |
DocToMarkupConverter [source]
Convert a Doc to a string with inline markup.
This is the inverse of :class:MarkupToDocConverter. It renders selected spans as either Markdown-like tags ([text](LABEL key=val ...)) or XML-like tags (<LABEL key=val ...>text</LABEL>).
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span_getter | Which spans to render from the document. TYPE: |
span_attributes | Mapping from Span extensions (or builtins like TYPE: |
default_attributes | When an attribute equals its provided default value, it is omitted from the output (e.g., avoid printing TYPE: |
preset | Output syntax. TYPE: |