Skip to content

BRAT and Standoff

TLDR
import edsnlp

stream = edsnlp.data.read_standoff(path)
stream = stream.map_pipeline(nlp)
res = stream.write_standoff(path)
# or equivalently
edsnlp.data.write_standoff(stream, path)

You can easily integrate BRAT into your project by using EDS-NLP's BRAT reader and writer.

BRAT annotations are in the standoff format. Consider the following document:

doc.txt
Le patient est admis pour une pneumopathie au coronavirus.
On lui prescrit du paracétamol.

Brat annotations are stored in a separate file formatted as follows:

doc.ann
T1  Patient 4 11    patient
T2  Disease 31 58   pneumopathie au coronavirus
T3  Drug 79 90  paracétamol

Reading Standoff files[source]

The BratReader (or edsnlp.data.read_standoff) reads a directory of BRAT files and yields documents. At the moment, only entities and attributes are loaded. Relations and events are not supported.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_standoff("path/to/brat/directory")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_standoff returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list :

docs = list(edsnlp.data.read_standoff("path/to/brat/directory"))

True/False attributes

Boolean values are not supported by the BRAT editor, and are stored as empty (key: empty value) if true, and not stored otherwise. This means that False values will not be assigned to attributes by default, which can be problematic when deciding if an entity is negated or not : is the entity not negated, or has the negation attribute not been annotated ?

To avoid this issue, you can use the bool_attributes argument to specify which attributes should be considered as boolean when reading a BRAT dataset. These attributes will be assigned a value of True if they are present, and False otherwise.

doc_iterator = edsnlp.data.read_standoff(
    "path/to/brat/directory",
    span_attributes=["negation", "family"],
    bool_attributes=["negation"],  # Missing values will be set to False
)

Parameters

PARAMETER DESCRIPTION
path

Path to the directory containing the BRAT files (will recursively look for files in subdirectories).

TYPE: Union[str, Path]

shuffle

Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping).

TYPE: Literal['dataset', False] DEFAULT: False

seed

The seed to use for shuffling.

TYPE: Optional[int] DEFAULT: None

loop

Whether to loop over the data indefinitely.

TYPE: bool DEFAULT: False

nlp

The pipeline object (optional and likely not needed, prefer to use the tokenizer directly argument instead).

TYPE: Optional[PipelineProtocol]

tokenizer

The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :

  • the tokenizer of the next pipeline run by .map_pipeline in a Stream.
  • or the eds tokenizer by default.

TYPE: Optional[Tokenizer]

span_setter

The span setter to use when setting the spans in the documents. Defaults to setting the spans in the ents attribute, and creates a new span group for each JSON entity label.

TYPE: SpanSetterArg

span_attributes

Mapping from BRAT attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name.

TYPE: Optional[AttributesMappingArg]

keep_raw_attribute_values

Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans).

TYPE: bool

default_attributes

How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time.

TYPE: AttributesMappingArg

notes_as_span_attribute

If set, the AnnotatorNote annotations will be concatenated and stored in a span attribute with this name.

TYPE: Optional[str]

split_fragments

Whether to split the fragments into separate spans or not. If set to False, the fragments will be concatenated into a single span.

TYPE: bool

keep_ipynb_checkpoints

Whether to keep the files that are in the .ipynb_checkpoints directory.

TYPE: bool DEFAULT: False

keep_txt_only_docs

Whether to keep the .txt files that do not have corresponding .ann files.

TYPE: bool DEFAULT: False

converter

Converter to use to convert the documents to dictionary objects.

TYPE: Optional[AsList[Union[str, Callable]]] DEFAULT: ['standoff']

filesystem

The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. s3:// will use S3).

TYPE: Optional[FileSystem] DEFAULT: None

RETURNS DESCRIPTION
Stream

Writing Standoff files[source]

edsnlp.data.write_standoff writes a list of documents using the BRAT/Standoff format in a directory. The BRAT files will be named after the note_id attribute of the documents, and subdirectories will be created if the name contains / characters.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)

doc = nlp("My document with entities")

edsnlp.data.write_standoff([doc], "path/to/brat/directory")

Overwriting files

By default, write_standoff will raise an error if the directory already exists and contains files with .a* or .txt suffixes. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True.

Parameters

PARAMETER DESCRIPTION
data

The data to write (either a list of documents or a Stream).

TYPE: Union[Any, Stream]

path

Path to the directory containing the BRAT files (will recursively look for files in subdirectories).

TYPE: Union[str, Path]

span_getter

The span getter to use when listing the spans that will be exported as BRAT entities. Defaults to getting the spans in the ents attribute.

span_attributes

Mapping from BRAT attributes to Span extension. By default, no attribute will be exported.

overwrite

Whether to overwrite existing directories.

TYPE: bool DEFAULT: False

filesystem

The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. s3:// will use S3).

TYPE: Optional[FileSystem] DEFAULT: None

execute

Whether to execute the writing operation immediately or to return a stream

TYPE: bool DEFAULT: True

converter

Converter to use to convert the documents to dictionary objects. Defaults to the "standoff" format converter.

TYPE: Optional[Union[str, Callable]] DEFAULT: 'standoff'