Skip to content

JSON

TLDR
import edsnlp

stream = edsnlp.data.read_json(path, converter="omop")
stream = stream.map_pipeline(nlp)
res = stream.to_json(path, converter="omop")
# or equivalently
edsnlp.data.to_json(stream, path, converter="omop")

We provide methods to read and write documents (raw or annotated) from and to json files.

As an example, imagine that we have the following document that uses the OMOP schema

data.jsonl
{ "note_id": 0, "note_text": "Le patient ...", "note_datetime": "2021-10-23", "entities": [...] }
{ "note_id": 1, "note_text": "Autre doc ...", "note_datetime": "2022-12-24", "entities": [] }
...

You could also have multiple .json files in a directory, the reader will read them all.

Reading JSON files[source]

The JsonReader (or edsnlp.data.read_json) reads a directory of JSON files and yields documents. At the moment, only entities and attributes are loaded.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_json("path/to/json/dir", converter="omop")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_json returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list

docs = list(edsnlp.data.read_json("path/to/json/dir", converter="omop")

Parameters

PARAMETER DESCRIPTION
path

Path to the directory containing the JSON files (will recursively look for files in subdirectories).

TYPE: Union[str, Path]

keep_ipynb_checkpoints

Whether to keep the files have ".ipynb_checkpoints" in their path.

TYPE: bool DEFAULT: False

filesystem

The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. s3:// will use S3).

TYPE: Optional[FileSystem] DEFAULT: None

shuffle

Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping).

TYPE: Literal['dataset', False] DEFAULT: False

seed

The seed to use for shuffling.

TYPE: int DEFAULT: 42

loop

Whether to loop over the data indefinitely.

TYPE: bool DEFAULT: False

converter

Converters to use to convert the JSON objects to Doc objects. These are documented on the Converters page.

TYPE: Optional[AsList[Union[str, Callable]]] DEFAULT: None

kwargs

Additional keyword arguments to pass to the converter. These are documented on the Converters page.

DEFAULT: {}

RETURNS DESCRIPTION
Stream

Writing JSON files[source]

edsnlp.data.write_json writes a list of documents using the JSON format in a directory. If lines is false, each document will be stored in its own JSON file, named after the FILENAME field returned by the converter (commonly the note_id attribute of the documents), and subdirectories will be created if the name contains / characters.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)

doc = nlp("My document with entities")

edsnlp.data.write_json([doc], "path/to/json/file", converter="omop", lines=True)
# or to write a directory of JSON files, ensure that each doc has a doc._.note_id
# attribute, since this will be used as a filename:
edsnlp.data.write_json([doc], "path/to/json/dir", converter="omop", lines=False)

Overwriting files

By default, write_json will raise an error if the directory already exists and contains files with .a* or .txt suffixes. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True.

Parameters

PARAMETER DESCRIPTION
data

The data to write (either a list of documents or a Stream).

TYPE: Union[Any, Stream]

path

Path to either - a file if lines is true : this will write the documents as a JSONL file - a directory if lines is false: this will write one JSON file per document using the FILENAME field returned by the converter (commonly the note_id attribute of the documents) as the filename.

TYPE: Union[str, Path]

lines

Whether to write the documents as a JSONL file or as a directory of JSON files. By default, this is inferred from the path: if the path is a file, lines is assumed to be true, otherwise it is assumed to be false.

TYPE: bool DEFAULT: None

overwrite

Whether to overwrite existing directories.

TYPE: bool DEFAULT: False

execute

Whether to execute the writing operation immediately or to return a stream

TYPE: bool DEFAULT: True

converter

Converter to use to convert the documents to dictionary objects before writing them. These are documented on the Converters page.

TYPE: Optional[Union[str, Callable]] DEFAULT: None

filesystem

The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. s3:// will use S3).

TYPE: Optional[FileSystem] DEFAULT: None

kwargs

Additional keyword arguments to pass to the converter. These are documented on the Converters page.

DEFAULT: {}