Parquet
TLDR
import edsnlp
iterator = edsnlp.data.read_parquet(source_path, converter="omop")
docs = nlp.pipe(iterator)
res = edsnlp.data.write_parquet(dest_path, docs, converter="omop")
We provide methods to read and write documents (raw or annotated) from and to parquet files.
As an example, imagine that we have the following document that uses the OMOP schema (parquet files are not actually stored as human-readable text, but this is for the sake of the example):
{ "note_id": 0, "note_text": "Le patient ...", "note_datetime": "2021-10-23", "entities": [...] }
{ "note_id": 1, "note_text": "Autre doc ...", "note_datetime": "2022-12-24", "entities": [] }
...
You could also have multiple parquet files in a directory, the reader will read them all.
Reading Parquet files
The ParquetReader (or edsnlp.data.read_parquet
) reads a directory of parquet files (or a single file) and yields documents.
Example
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_parquet("path/to/parquet", converter="omop")
annotated_docs = nlp.pipe(doc_iterator)
Generator vs list
edsnlp.data.read_parquet
returns a LazyCollection. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list
docs = list(edsnlp.data.read_parquet("path/to/parquet", converter="omop"))
Parameters
PARAMETER | DESCRIPTION |
---|---|
path | Path to the directory containing the parquet files (will recursively look for files in subdirectories). Supports any filesystem supported by pyarrow. TYPE: |
converter | Converter to use to convert the parquet rows of the data source to Doc objects TYPE: |
read_in_worker | Whether to read the files in the worker or in the main process. TYPE: |
kwargs | Additional keyword arguments to pass to the converter. These are documented on the Data schemas page. DEFAULT: |
RETURNS | DESCRIPTION |
---|---|
LazyCollection | |
Writing Parquet files
edsnlp.data.write_parquet
writes a list of documents as a parquet dataset.
Example
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc = nlp("My document with entities")
edsnlp.data.write_parquet([doc], "path/to/parquet")
Overwriting files
By default, write_parquet
will raise an error if the directory already exists and contains parquet files. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True
.
Parameters
PARAMETER | DESCRIPTION |
---|---|
data | The data to write (either a list of documents or a LazyCollection). TYPE: |
path | Path to the directory containing the parquet files (will recursively look for files in subdirectories). Supports any filesystem supported by pyarrow. TYPE: |
num_rows_per_file | The maximum number of documents to write in each parquet file. TYPE: |
overwrite | Whether to overwrite existing directories. TYPE: |
write_in_worker | Whether to write the files in the workers or in the main process. TYPE: |
accumulate | Whether to accumulate the results sent to the writer by workers until the batch is full or the writer is finalized. If False, each file will not be larger than the size of the batches it receives. This option requires that the writer is finalized before the end of the processing, which may not be compatible with some backends, such as If TYPE: |
converter | Converter to use to convert the documents to dictionary objects before writing them. TYPE: |
kwargs | Additional keyword arguments to pass to the converter. These are documented on the Data schemas page. DEFAULT: |