Schemas
Data can be read from and writen to various sources, like JSON/BRAT/CSV files or dataframes, and arranged following different schemas. The schema defines the structure of the data and how it should be interpreted. We detail here the different schemas that are currently supported, and how to configure them to be converted from and to Doc objects.
These parameters can be passed to the from_*
(or read_*
in the case of files) and to_*
(or write_*
in the case of files) methods, depending on the chosen converter
argument.
OMOP
OMOP is a schema that is used in the medical domain. It is based on the OMOP Common Data Model. We are mainly interested in the note
table, which contains the clinical notes, and deviate from the original schema by adding an optional entities
column that can be computed from the note_nlp
table.
Therefore, a complete OMOP-style document would look like this:
{
"note_id": 0,
"note_text": "Le patient ...",
"entities": [
{
"note_nlp_id": 0,
"start_char": 3,
"end_char": 10,
"lexical_variant": "patient",
"note_nlp_source_value": "person",
# optional fields
"negated": False,
"certainty": "probable",
...
},
...
],
# optional fields
"custom_doc_field": "..."
...
}
Converting OMOP data to Doc objects
PARAMETER | DESCRIPTION |
---|---|
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
span_setter | The span setter to use when setting the spans in the documents. Defaults to setting the spans in the TYPE: |
doc_attributes | Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Doc extensions with the same name. TYPE: |
span_attributes | Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: |
bool_attributes | List of attributes for which missing values should be set to False. TYPE: |
Converting Doc objects to OMOP data
PARAMETER | DESCRIPTION |
---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
doc_attributes | Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported. TYPE: |
Standoff
Standoff refers mostly to the BRAT standoff format, but doesn't indicate how the annotations should be stored in a JSON-like schema. We use the following schema:
{
"doc_id": 0,
"text": "Le patient ...",
"entities": [
{
"entity_id": 0,
"label": "drug",
"fragments": [{
"start": 0,
"end": 10
}],
"attributes": {
"negated": True,
"certainty": "probable"
}
},
...
]
}
Converting Standoff data to Doc objects
PARAMETER | DESCRIPTION |
---|---|
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
span_setter | The span setter to use when setting the spans in the documents. Defaults to setting the spans in the TYPE: |
span_attributes | Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: |
keep_raw_attribute_values | Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans). TYPE: |
bool_attributes | List of attributes for which missing values should be set to False. TYPE: |
Converting Doc objects to Standoff data
PARAMETER | DESCRIPTION |
---|---|
span_getter | The span getter to use when getting the spans from the documents. Defaults to getting the spans in the TYPE: |
span_attributes | Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported, except TYPE: |