BRAT and Standoff
TLDR
import edsnlp
stream = edsnlp.data.read_standoff(path)
stream = stream.map_pipeline(nlp)
res = stream.write_standoff(path)
# or equivalently
edsnlp.data.write_standoff(stream, path)
You can easily integrate BRAT into your project by using EDS-NLP's BRAT reader and writer.
BRAT annotations are in the standoff format. Consider the following document:
Le patient est admis pour une pneumopathie au coronavirus.
On lui prescrit du paracétamol.
Brat annotations are stored in a separate file formatted as follows:
T1 Patient 4 11 patient
T2 Disease 31 58 pneumopathie au coronavirus
T3 Drug 79 90 paracétamol
Reading Standoff files[source]
The BratReader (or edsnlp.data.read_standoff
) reads a directory of BRAT files and yields documents. At the moment, only entities and attributes are loaded. Relations and events are not supported.
Example
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_standoff("path/to/brat/directory")
annotated_docs = nlp.pipe(doc_iterator)
Generator vs list
edsnlp.data.read_standoff
returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list :
docs = list(edsnlp.data.read_standoff("path/to/brat/directory"))
True/False attributes
Boolean values are not supported by the BRAT editor, and are stored as empty (key: empty value) if true, and not stored otherwise. This means that False values will not be assigned to attributes by default, which can be problematic when deciding if an entity is negated or not : is the entity not negated, or has the negation attribute not been annotated ?
To avoid this issue, you can use the bool_attributes
argument to specify which attributes should be considered as boolean when reading a BRAT dataset. These attributes will be assigned a value of True
if they are present, and False
otherwise.
doc_iterator = edsnlp.data.read_standoff(
"path/to/brat/directory",
span_attributes=["negation", "family"],
bool_attributes=["negation"], # Missing values will be set to False
)
Parameters
PARAMETER | DESCRIPTION |
---|---|
path | Path to the directory containing the BRAT files (will recursively look for files in subdirectories). TYPE: |
shuffle | Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping). TYPE: |
seed | The seed to use for shuffling. TYPE: |
loop | Whether to loop over the data indefinitely. TYPE: |
nlp | The pipeline object (optional and likely not needed, prefer to use the TYPE: |
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
span_setter | The span setter to use when setting the spans in the documents. Defaults to setting the spans in the TYPE: |
span_attributes | Mapping from BRAT attributes to Span extensions (can be a list too). By default, all attributes are imported as Span extensions with the same name. TYPE: |
keep_raw_attribute_values | Whether to keep the raw attribute values (as strings) or to convert them to Python objects (e.g. booleans). TYPE: |
default_attributes | How to set attributes on spans for which no attribute value was found in the input format. This is especially useful for negation, or frequent attributes values (e.g. "negated" is often False, "temporal" is often "present"), that annotators may not want to annotate every time. TYPE: |
notes_as_span_attribute | If set, the AnnotatorNote annotations will be concatenated and stored in a span attribute with this name. TYPE: |
split_fragments | Whether to split the fragments into separate spans or not. If set to False, the fragments will be concatenated into a single span. TYPE: |
keep_ipynb_checkpoints | Whether to keep the files that are in the TYPE: |
keep_txt_only_docs | Whether to keep the TYPE: |
converter | Converter to use to convert the documents to dictionary objects. TYPE: |
filesystem | The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Stream | |
Writing Standoff files[source]
edsnlp.data.write_standoff
writes a list of documents using the BRAT/Standoff format in a directory. The BRAT files will be named after the note_id
attribute of the documents, and subdirectories will be created if the name contains /
characters.
Example
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc = nlp("My document with entities")
edsnlp.data.write_standoff([doc], "path/to/brat/directory")
Overwriting files
By default, write_standoff
will raise an error if the directory already exists and contains files with .a*
or .txt
suffixes. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True
.
Parameters
PARAMETER | DESCRIPTION |
---|---|
data | The data to write (either a list of documents or a Stream). TYPE: |
path | Path to the directory containing the BRAT files (will recursively look for files in subdirectories). TYPE: |
span_getter | The span getter to use when listing the spans that will be exported as BRAT entities. Defaults to getting the spans in the
|
span_attributes | Mapping from BRAT attributes to Span extension. By default, no attribute will be exported.
|
overwrite | Whether to overwrite existing directories. TYPE: |
filesystem | The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. TYPE: |
execute | Whether to execute the writing operation immediately or to return a stream TYPE: |
converter | Converter to use to convert the documents to dictionary objects. Defaults to the "standoff" format converter. TYPE: |