Parquet

TLDR

import edsnlp

iterator = edsnlp.data.read_parquet(source_path, converter="omop")
docs = nlp.pipe(iterator)
res = edsnlp.data.write_parquet(dest_path, docs, converter="omop")

We provide methods to read and write documents (raw or annotated) from and to parquet files.

As an example, imagine that we have the following document that uses the OMOP schema (parquet files are not actually stored as human-readable text, but this is for the sake of the example):

data.pq

{ "note_id": 0, "note_text": "Le patient ...", "note_datetime": "2021-10-23", "entities": [...] }
{ "note_id": 1, "note_text": "Autre doc ...", "note_datetime": "2022-12-24", "entities": [] }
...

You could also have multiple parquet files in a directory, the reader will read them all.

Reading Parquet files

The ParquetReader (or edsnlp.data.read_parquet) reads a directory of parquet files (or a single file) and yields documents.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_parquet("path/to/parquet", converter="omop")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_parquet returns a LazyCollection. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list

docs = list(edsnlp.data.read_parquet("path/to/parquet", converter="omop"))

Parameters

PARAMETER	DESCRIPTION
`path`	Path to the directory containing the parquet files (will recursively look for files in subdirectories). Supports any filesystem supported by pyarrow. TYPE: `Union[str, Path]`
`converter`	Converter to use to convert the parquet rows of the data source to Doc objects TYPE: `Union[str, Callable]`
`read_in_worker`	Whether to read the files in the worker or in the main process. TYPE: `bool` DEFAULT: `False`
`kwargs`	Additional keyword arguments to pass to the converter. These are documented on the Data schemas page. DEFAULT: `{}`

RETURNS	DESCRIPTION
`LazyCollection`

Writing Parquet files

edsnlp.data.write_parquet writes a list of documents as a parquet dataset.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)

doc = nlp("My document with entities")

edsnlp.data.write_parquet([doc], "path/to/parquet")

Overwriting files

By default, write_parquet will raise an error if the directory already exists and contains parquet files. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True.

Parameters

PARAMETER	DESCRIPTION
`data`	The data to write (either a list of documents or a LazyCollection). TYPE: `Union[Any, LazyCollection]`
`path`	Path to the directory containing the parquet files (will recursively look for files in subdirectories). Supports any filesystem supported by pyarrow. TYPE: `Union[str, Path]`
`num_rows_per_file`	The maximum number of documents to write in each parquet file. TYPE: `int` DEFAULT: `1024`
`overwrite`	Whether to overwrite existing directories. TYPE: `bool` DEFAULT: `False`
`write_in_worker`	Whether to write the files in the workers or in the main process. TYPE: `bool` DEFAULT: `False`
`accumulate`	Whether to accumulate the results sent to the writer by workers until the batch is full or the writer is finalized. If False, each file will not be larger than the size of the batches it receives. This option requires that the writer is finalized before the end of the processing, which may not be compatible with some backends, such as `spark`. If `write_in_worker` is True, documents will be accumulated in each worker but not across workers, therefore leading to a larger number of files. TYPE: `bool` DEFAULT: `True`
`converter`	Converter to use to convert the documents to dictionary objects before writing them. TYPE: `Optional[Union[str, Callable]]`
`kwargs`	Additional keyword arguments to pass to the converter. These are documented on the Data schemas page. DEFAULT: `{}`