Skip to content

Parquet

TLDR
import edsnlp

iterator = edsnlp.data.read_parquet(source_path, converter="omop")
docs = nlp.pipe(iterator)
res = edsnlp.data.write_parquet(dest_path, docs, converter="omop")

We provide methods to read and write documents (raw or annotated) from and to parquet files.

As an example, imagine that we have the following document that uses the OMOP schema (parquet files are not actually stored as human-readable text, but this is for the sake of the example):

data.pq
{ "note_id": 0, "note_text": "Le patient ...", "note_datetime": "2021-10-23", "entities": [...] }
{ "note_id": 1, "note_text": "Autre doc ...", "note_datetime": "2022-12-24", "entities": [] }
...

You could also have multiple parquet files in a directory, the reader will read them all.

Reading Parquet files

The ParquetReader (or edsnlp.data.read_parquet) reads a directory of parquet files (or a single file) and yields documents.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_parquet("path/to/parquet", converter="omop")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_parquet returns a LazyCollection. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list

docs = list(edsnlp.data.read_parquet("path/to/parquet", converter="omop"))

Parameters

PARAMETER DESCRIPTION
path

Path to the directory containing the parquet files (will recursively look for files in subdirectories). Supports any filesystem supported by pyarrow.

TYPE: Union[str, Path]

converter

Converter to use to convert the parquet rows of the data source to Doc objects

TYPE: Union[str, Callable]

read_in_worker

Whether to read the files in the worker or in the main process.

TYPE: bool DEFAULT: False

kwargs

Additional keyword arguments to pass to the converter. These are documented on the Data schemas page.

DEFAULT: {}

RETURNS DESCRIPTION
LazyCollection

Writing Parquet files

edsnlp.data.write_parquet writes a list of documents as a parquet dataset.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)

doc = nlp("My document with entities")

edsnlp.data.write_parquet([doc], "path/to/parquet")

Overwriting files

By default, write_parquet will raise an error if the directory already exists and contains parquet files. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True.

Parameters

PARAMETER DESCRIPTION
data

The data to write (either a list of documents or a LazyCollection).

TYPE: Union[Any, LazyCollection]

path

Path to the directory containing the parquet files (will recursively look for files in subdirectories). Supports any filesystem supported by pyarrow.

TYPE: Union[str, Path]

num_rows_per_file

The maximum number of documents to write in each parquet file.

TYPE: int DEFAULT: 1024

overwrite

Whether to overwrite existing directories.

TYPE: bool DEFAULT: False

write_in_worker

Whether to write the files in the workers or in the main process.

TYPE: bool DEFAULT: False

accumulate

Whether to accumulate the results sent to the writer by workers until the batch is full or the writer is finalized. If False, each file will not be larger than the size of the batches it receives. This option requires that the writer is finalized before the end of the processing, which may not be compatible with some backends, such as spark.

If write_in_worker is True, documents will be accumulated in each worker but not across workers, therefore leading to a larger number of files.

TYPE: bool DEFAULT: True

converter

Converter to use to convert the documents to dictionary objects before writing them.

TYPE: Optional[Union[str, Callable]]

kwargs

Additional keyword arguments to pass to the converter. These are documented on the Data schemas page.

DEFAULT: {}