Schemas

Data can be read from and writen to various sources, like JSON/BRAT/CSV files or dataframes, and arranged following different schemas. The schema defines the structure of the data and how it should be interpreted. We detail here the different schemas that are currently supported, and how to configure them to be converted from and to Doc objects.

These parameters can be passed to the from_* (or read_* in the case of files) and to_* (or write_* in the case of files) methods, depending on the chosen converter argument.

OMOP

OMOP is a schema that is used in the medical domain. It is based on the OMOP Common Data Model. We are mainly interested in the note table, which contains the clinical notes, and deviate from the original schema by adding an optional entities column that can be computed from the note_nlp table.

Therefore, a complete OMOP-style document would look like this:

{
  "note_id": 0,
  "note_text": "Le patient ...",
  "entities": [
    {
      "note_nlp_id": 0,
      "start_char": 3,
      "end_char": 10,
      "lexical_variant": "patient",
      "note_nlp_source_value": "person",

      # optional fields
      "negated": False,
      "certainty": "probable",
      ...
    },
    ...
  ],

  # optional fields
  "custom_doc_field": "..."
  ...
}

Converting OMOP data to Doc objects

PARAMETER	DESCRIPTION
`tokenizer`	The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer : the tokenizer of the next pipeline run by `.map_pipeline` in a LazyCollection. or the `eds` tokenizer by default. TYPE: `Optional[PipelineProtocol]` DEFAULT: `None`
`span_setter`	The span setter to use when setting the spans in the documents. Defaults to setting the spans in the `ents` attribute, and creates a new span group for each JSON entity label. TYPE: `SpanSetterArg` DEFAULT: `{'ents': True, '*': True}`
`doc_attributes`	Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Doc extensions with the same name. TYPE: `AttributesMappingArg` DEFAULT: `{}`
`span_attributes`	Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: `Optional[AttributesMappingArg]` DEFAULT: `None`
`bool_attributes`	List of attributes for which missing values should be set to False. TYPE: `SequenceStr` DEFAULT: `[]`

Converting Doc objects to OMOP data

PARAMETER DESCRIPTION

span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: SpanGetterArg DEFAULT: {'ents': True}

doc_attributes

Mapping from Doc extensions to JSON attributes (can be a list too). By default, no doc attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported.

TYPE: AttributesMappingArg DEFAULT: {}

Standoff

Standoff refers mostly to the BRAT standoff format, but doesn't indicate how the annotations should be stored in a JSON-like schema. We use the following schema:

{
  "doc_id": 0,
  "text": "Le patient ...",
  "entities": [
    {
      "entity_id": 0,
      "label": "drug",
      "fragments": [{
        "start": 0,
        "end": 10
      }],
      "attributes": {
        "negated": True,
        "certainty": "probable"
      }
    },
    ...
  ]
}

Converting Standoff data to Doc objects

PARAMETER	DESCRIPTION
`tokenizer`	The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer : the tokenizer of the next pipeline run by `.map_pipeline` in a LazyCollection. or the `eds` tokenizer by default. TYPE: `Optional[Tokenizer]` DEFAULT: `None`
`span_setter`	The span setter to use when setting the spans in the documents. Defaults to setting the spans in the `ents` attribute, and creates a new span group for each JSON entity label. TYPE: `SpanSetterArg` DEFAULT: `{'ents': True, '*': True}`
`span_attributes`	Mapping from JSON attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: `Optional[AttributesMappingArg]` DEFAULT: `None`
`keep_raw_attribute_values`	Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans). TYPE: `bool` DEFAULT: `False`
`bool_attributes`	List of attributes for which missing values should be set to False. TYPE: `SequenceStr` DEFAULT: `[]`

Converting Doc objects to Standoff data

PARAMETER DESCRIPTION

span_getter

The span getter to use when getting the spans from the documents. Defaults to getting the spans in the ents attribute.

TYPE: Optional[SpanGetterArg] DEFAULT: None

span_attributes

Mapping from Span extensions to JSON attributes (can be a list too). By default, no attribute is exported, except note_id.

TYPE: AttributesMappingArg DEFAULT: {}