Skip to content

BRAT and Standoff

TLDR
import edsnlp

doc_iterator = edsnlp.data.from_standoff(path)
res = edsnlp.data.write_standoff(docs, path)

You can easily integrate BRAT into your project by using EDS-NLP's BRAT reader and writer.

BRAT annotations are in the standoff format. Consider the following document:

doc.txt
Le patient est admis pour une pneumopathie au coronavirus.
On lui prescrit du paracétamol.

Brat annotations are stored in a separate file formatted as follows:

doc.ann
T1  Patient 4 11    patient
T2  Disease 31 58   pneumopathie au coronavirus
T3  Drug 79 90  paracétamol

Reading Standoff files

The BratReader (or edsnlp.data.read_standoff) reads a directory of BRAT files and yields documents. At the moment, only entities and attributes are loaded. Relations and events are not supported.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_standoff("path/to/brat/directory")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_standoff returns a LazyCollection. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list :

docs = list(edsnlp.data.read_standoff("path/to/brat/directory"))

Parameters

PARAMETER DESCRIPTION
path

Path to the directory containing the BRAT files (will recursively look for files in subdirectories).

TYPE: Union[str, Path]

nlp

The pipeline instance (defaults to edsnlp.blank("eds")) used to tokenize the documents.

span_setter

The span setter to use when setting the spans in the documents. Defaults to setting the spans in the ents attribute, and creates a new span group for each BRAT entity label.

span_attributes

Mapping from BRAT

RETURNS DESCRIPTION
LazyCollection

Writing Standoff files

edsnlp.data.write_standoff writes a list of documents using the BRAT/Standoff format in a directory. The BRAT files will be named after the note_id attribute of the documents, and subdirectories will be created if the name contains / characters.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)

doc = nlp("My document with entities")

edsnlp.data.write_standoff([doc], "path/to/brat/directory")

Overwriting files

By default, write_standoff will raise an error if the directory already exists and contains files with .a* or .txt suffixes. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True.

Parameters

PARAMETER DESCRIPTION
data

The data to write (either a list of documents or a LazyCollection).

TYPE: Union[Any, LazyCollection]

path

Path to the directory containing the BRAT files (will recursively look for files in subdirectories).

TYPE: Union[str, Path]

span_getter

The span getter to use when listing the spans that will be exported as BRAT entities. Defaults to getting the spans in the ents attribute.

span_attributes

Mapping from BRAT attributes to Span extension. By default, no attribute will be exported.

overwrite

Whether to overwrite existing directories.

TYPE: bool DEFAULT: False

converter

Converter to use to convert the documents to dictionary objects. Defaults to the "standoff" format converter.

TYPE: Optional[Union[str, Callable]] DEFAULT: 'standoff'