BRAT and Standoff

TLDR

import edsnlp

stream = edsnlp.data.read_standoff(path)
stream = stream.map_pipeline(nlp)
res = stream.write_standoff(path)
# or equivalently
edsnlp.data.write_standoff(stream, path)

You can easily integrate BRAT into your project by using EDS-NLP's BRAT reader and writer.

BRAT annotations are in the standoff format. Consider the following document:

doc.txt

Le patient est admis pour une pneumopathie au coronavirus.
On lui prescrit du paracétamol.

Brat annotations are stored in a separate file formatted as follows:

doc.ann

T1  Patient 4 11    patient
T2  Disease 31 58   pneumopathie au coronavirus
T3  Drug 79 90  paracétamol

Reading Standoff files

The BratReader (or edsnlp.data.read_standoff) reads a directory of BRAT files and yields documents. At the moment, only entities and attributes are loaded. Relations and events are not supported.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_standoff("path/to/brat/directory")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_standoff returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list :

docs = list(edsnlp.data.read_standoff("path/to/brat/directory"))

True/False attributes

Boolean values are not supported by the BRAT editor, and are stored as empty (key: empty value) if true, and not stored otherwise. This means that False values will not be assigned to attributes by default, which can be problematic when deciding if an entity is negated or not : is the entity not negated, or has the negation attribute not been annotated ?

To avoid this issue, you can use the bool_attributes argument to specify which attributes should be considered as boolean when reading a BRAT dataset. These attributes will be assigned a value of True if they are present, and False otherwise.

doc_iterator = edsnlp.data.read_standoff(
    "path/to/brat/directory",
    span_attributes=["negation", "family"],
    bool_attributes=["negation"],  # Missing values will be set to False
)

Parameters

PARAMETER	DESCRIPTION
`path`	Path to the directory containing the BRAT files (will recursively look for files in subdirectories). TYPE: `Union[str, Path]`
`shuffle`	Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping). TYPE: `Literal['dataset', False]` DEFAULT: `False`
`seed`	The seed to use for shuffling. TYPE: `Optional[int]` DEFAULT: `None`
`loop`	Whether to loop over the data indefinitely. TYPE: `bool` DEFAULT: `False`
`nlp`	The pipeline object (optional and likely not needed, prefer to use the `tokenizer` directly argument instead). TYPE: `Optional[PipelineProtocol]`
`tokenizer`	The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer : the tokenizer of the next pipeline run by `.map_pipeline` in a Stream. or the `eds` tokenizer by default. TYPE: `Optional[Tokenizer]`
`span_setter`	The span setter to use when setting the spans in the documents. Defaults to setting the spans in the `ents` attribute, and creates a new span group for each JSON entity label. TYPE: `SpanSetterArg`
`span_attributes`	Mapping from BRAT attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: `Optional[AttributesMappingArg]`
`keep_raw_attribute_values`	Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans). TYPE: `bool`
`default_attributes`	How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time. TYPE: `AttributesMappingArg`
`notes_as_span_attribute`	If set, the AnnotatorNote annotations will be concatenated and stored in a span attribute with this name. TYPE: `Optional[str]`
`split_fragments`	Whether to split the fragments into separate spans or not. If set to False, the fragments will be concatenated into a single span. TYPE: `bool`
`keep_ipynb_checkpoints`	Whether to keep the files that are in the `.ipynb_checkpoints` directory. TYPE: `bool` DEFAULT: `False`
`keep_txt_only_docs`	Whether to keep the `.txt` files that do not have corresponding `.ann` files. TYPE: `bool` DEFAULT: `False`
`converter`	Converter to use to convert the documents to dictionary objects. TYPE: `Optional[AsList[Union[str, Callable]]]` DEFAULT: `['standoff']`
`filesystem`	The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. `s3://` will use S3). TYPE: `Optional[FileSystem]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Stream`

Writing Standoff files

edsnlp.data.write_standoff writes a list of documents using the BRAT/Standoff format in a directory. The BRAT files will be named after the note_id attribute of the documents, and subdirectories will be created if the name contains / characters.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)

doc = nlp("My document with entities")

edsnlp.data.write_standoff([doc], "path/to/brat/directory")

Overwriting files

By default, write_standoff will raise an error if the directory already exists and contains files with .a* or .txt suffixes. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True.

Parameters

PARAMETER	DESCRIPTION
`data`	The data to write (either a list of documents or a Stream). TYPE: `Union[Any, Stream]`
`path`	Path to the directory containing the BRAT files (will recursively look for files in subdirectories). TYPE: `Union[str, Path]`
`span_getter`	The span getter to use when listing the spans that will be exported as BRAT entities. Defaults to getting the spans in the `ents` attribute.
`span_attributes`	Mapping from BRAT attributes to Span extension. By default, no attribute will be exported.
`overwrite`	Whether to overwrite existing directories. TYPE: `bool` DEFAULT: `False`
`filesystem`	The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. `s3://` will use S3). TYPE: `Optional[FileSystem]` DEFAULT: `None`
`execute`	Whether to execute the writing operation immediately or to return a stream TYPE: `bool` DEFAULT: `True`
`converter`	Converter to use to convert the documents to dictionary objects. Defaults to the "standoff" format converter. TYPE: `Optional[Union[str, Callable]]` DEFAULT: `'standoff'`