JSON

TLDR

import edsnlp

stream = edsnlp.data.read_json(path, converter="omop")
stream = stream.map_pipeline(nlp)
res = stream.to_json(path, converter="omop")
# or equivalently
edsnlp.data.to_json(stream, path, converter="omop")

We provide methods to read and write documents (raw or annotated) from and to json files.

As an example, imagine that we have the following document that uses the OMOP schema

data.jsonl

{ "note_id": 0, "note_text": "Le patient ...", "note_datetime": "2021-10-23", "entities": [...] }
{ "note_id": 1, "note_text": "Autre doc ...", "note_datetime": "2022-12-24", "entities": [] }
...

You could also have multiple .json files in a directory, the reader will read them all.

Reading JSON files

The JsonReader (or edsnlp.data.read_json) reads a directory of JSON files and yields documents. At the moment, only entities and attributes are loaded.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_json("path/to/json/dir", converter="omop")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_json returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list

docs = list(edsnlp.data.read_json("path/to/json/dir", converter="omop")

Parameters

PARAMETER	DESCRIPTION
`path`	Path to the directory containing the JSON files (will recursively look for files in subdirectories). TYPE: `Union[str, Path]`
`keep_ipynb_checkpoints`	Whether to keep the files have ".ipynb_checkpoints" in their path. TYPE: `bool` DEFAULT: `False`
`filesystem`	The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. `s3://` will use S3). TYPE: `Optional[FileSystem]` DEFAULT: `None`
`shuffle`	Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping). TYPE: `Literal['dataset', False]` DEFAULT: `False`
`seed`	The seed to use for shuffling. TYPE: `int` DEFAULT: `42`
`loop`	Whether to loop over the data indefinitely. TYPE: `bool` DEFAULT: `False`
`converter`	Converters to use to convert the JSON objects to Doc objects. These are documented on the Converters page. TYPE: `Optional[AsList[Union[str, Callable]]]` DEFAULT: `None`
`kwargs`	Additional keyword arguments to pass to the converter. These are documented on the Converters page. DEFAULT: `{}`

RETURNS	DESCRIPTION
`Stream`

Writing JSON files

edsnlp.data.write_json writes a list of documents using the JSON format in a directory. If lines is false, each document will be stored in its own JSON file, named after the FILENAME field returned by the converter (commonly the note_id attribute of the documents), and subdirectories will be created if the name contains / characters.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)

doc = nlp("My document with entities")

edsnlp.data.write_json([doc], "path/to/json/file", converter="omop", lines=True)
# or to write a directory of JSON files, ensure that each doc has a doc._.note_id
# attribute, since this will be used as a filename:
edsnlp.data.write_json([doc], "path/to/json/dir", converter="omop", lines=False)

Overwriting files

By default, write_json will raise an error if the directory already exists and contains files with .a* or .txt suffixes. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True.

Parameters

PARAMETER	DESCRIPTION
`data`	The data to write (either a list of documents or a Stream). TYPE: `Union[Any, Stream]`
`path`	Path to either - a file if `lines` is true : this will write the documents as a JSONL file - a directory if `lines` is false: this will write one JSON file per document using the FILENAME field returned by the converter (commonly the `note_id` attribute of the documents) as the filename. TYPE: `Union[str, Path]`
`lines`	Whether to write the documents as a JSONL file or as a directory of JSON files. By default, this is inferred from the path: if the path is a file, lines is assumed to be true, otherwise it is assumed to be false. TYPE: `bool` DEFAULT: `None`
`overwrite`	Whether to overwrite existing directories. TYPE: `bool` DEFAULT: `False`
`execute`	Whether to execute the writing operation immediately or to return a stream TYPE: `bool` DEFAULT: `True`
`converter`	Converter to use to convert the documents to dictionary objects before writing them. These are documented on the Converters page. TYPE: `Optional[Union[str, Callable]]` DEFAULT: `None`
`filesystem`	The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. `s3://` will use S3). TYPE: `Optional[FileSystem]` DEFAULT: `None`
`kwargs`	Additional keyword arguments to pass to the converter. These are documented on the Converters page. DEFAULT: `{}`