Skip to content

JSON

TLDR
import edsnlp

iterator = edsnlp.data.from_pandas(df, converter="omop")
docs = nlp.pipe(iterator)
res = edsnlp.data.to_pandas(docs, converter="omop")

We provide methods to read and write documents (raw or annotated) from and to json files.

As an example, imagine that we have the following document that uses the OMOP schema

data.jsonl
{ "note_id": 0, "note_text": "Le patient ...", "note_datetime": "2021-10-23", "entities": [...] }
{ "note_id": 1, "note_text": "Autre doc ...", "note_datetime": "2022-12-24", "entities": [] }
...

You could also have multiple .json files in a directory, the reader will read them all.

Reading JSON files

The JsonReader (or edsnlp.data.read_json) reads a directory of JSON files and yields documents. At the moment, only entities and attributes are loaded.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_json("path/to/json/dir", converter="omop")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_json returns a LazyCollection. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list

docs = list(edsnlp.data.read_json("path/to/json/dir", converter="omop")

Parameters

PARAMETER DESCRIPTION
path

Path to the directory containing the JSON files (will recursively look for files in subdirectories).

TYPE: Union[str, Path]

converter

Converter to use to convert the JSON rows of the data source to Doc objects

TYPE: Union[str, Callable]

keep_ipynb_checkpoints

Whether to keep the files have ".ipynb_checkpoints" in their path.

TYPE: bool DEFAULT: False

read_in_worker

Whether to read the files in the worker or in the main process.

TYPE: bool DEFAULT: False

kwargs

Additional keyword arguments to pass to the converter. These are documented on the Data schemas page.

DEFAULT: {}

RETURNS DESCRIPTION
LazyCollection

Writing JSON files

edsnlp.data.write_json writes a list of documents using the JSON format in a directory. If lines is false, each document will be stored in its own JSON file, named after the FILENAME field returned by the converter (commonly the note_id attribute of the documents), and subdirectories will be created if the name contains / characters.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)

doc = nlp("My document with entities")

edsnlp.data.write_json([doc], "path/to/json/dir", converter="omop")

Overwriting files

By default, write_json will raise an error if the directory already exists and contains files with .a* or .txt suffixes. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True.

Parameters

PARAMETER DESCRIPTION
data

The data to write (either a list of documents or a LazyCollection).

TYPE: Union[Any, LazyCollection]

path

Path to the directory containing the JSON files (will recursively look for files in subdirectories).

TYPE: Union[str, Path]

lines

Whether to write the documents as a JSONL file (default).

TYPE: bool DEFAULT: True

overwrite

Whether to overwrite existing directories.

TYPE: bool DEFAULT: False

converter

Converter to use to convert the documents to dictionary objects before writing them.

TYPE: Optional[Union[str, Callable]]

kwargs

Additional keyword arguments to pass to the converter. These are documented on the Data schemas page.

DEFAULT: {}