JSON
TLDR
import edsnlp
stream = edsnlp.data.read_json(path, converter="omop")
stream = stream.map_pipeline(nlp)
res = stream.to_json(path, converter="omop")
# or equivalently
edsnlp.data.to_json(stream, path, converter="omop")
We provide methods to read and write documents (raw or annotated) from and to json files.
As an example, imagine that we have the following document that uses the OMOP schema
{ "note_id": 0, "note_text": "Le patient ...", "note_datetime": "2021-10-23", "entities": [...] }
{ "note_id": 1, "note_text": "Autre doc ...", "note_datetime": "2022-12-24", "entities": [] }
...
You could also have multiple .json
files in a directory, the reader will read them all.
Reading JSON files[source]
The JsonReader (or edsnlp.data.read_json
) reads a directory of JSON files and yields documents. At the moment, only entities and attributes are loaded.
Example
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_json("path/to/json/dir", converter="omop")
annotated_docs = nlp.pipe(doc_iterator)
Generator vs list
edsnlp.data.read_json
returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list
docs = list(edsnlp.data.read_json("path/to/json/dir", converter="omop")
Parameters
PARAMETER | DESCRIPTION |
---|---|
path | Path to the directory containing the JSON files (will recursively look for files in subdirectories). TYPE: |
keep_ipynb_checkpoints | Whether to keep the files have ".ipynb_checkpoints" in their path. TYPE: |
filesystem | The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. TYPE: |
shuffle | Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping). TYPE: |
seed | The seed to use for shuffling. TYPE: |
loop | Whether to loop over the data indefinitely. TYPE: |
converter | Converters to use to convert the JSON objects to Doc objects. These are documented on the Converters page. TYPE: |
kwargs | Additional keyword arguments to pass to the converter. These are documented on the Converters page. DEFAULT: |
RETURNS | DESCRIPTION |
---|---|
Stream | |
Writing JSON files[source]
edsnlp.data.write_json
writes a list of documents using the JSON format in a directory. If lines
is false, each document will be stored in its own JSON file, named after the FILENAME field returned by the converter (commonly the note_id
attribute of the documents), and subdirectories will be created if the name contains /
characters.
Example
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc = nlp("My document with entities")
edsnlp.data.write_json([doc], "path/to/json/file", converter="omop", lines=True)
# or to write a directory of JSON files, ensure that each doc has a doc._.note_id
# attribute, since this will be used as a filename:
edsnlp.data.write_json([doc], "path/to/json/dir", converter="omop", lines=False)
Overwriting files
By default, write_json
will raise an error if the directory already exists and contains files with .a*
or .txt
suffixes. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True
.
Parameters
PARAMETER | DESCRIPTION |
---|---|
data | The data to write (either a list of documents or a Stream). TYPE: |
path | Path to either - a file if TYPE: |
lines | Whether to write the documents as a JSONL file or as a directory of JSON files. By default, this is inferred from the path: if the path is a file, lines is assumed to be true, otherwise it is assumed to be false. TYPE: |
overwrite | Whether to overwrite existing directories. TYPE: |
execute | Whether to execute the writing operation immediately or to return a stream TYPE: |
converter | Converter to use to convert the documents to dictionary objects before writing them. These are documented on the Converters page. TYPE: |
filesystem | The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. TYPE: |
kwargs | Additional keyword arguments to pass to the converter. These are documented on the Converters page. DEFAULT: |