Parquet
TLDR
import edsnlp
iterator = edsnlp.data.read_parquet(source_path, converter="omop")
docs = nlp.pipe(iterator)
res = edsnlp.data.write_parquet(dest_path, docs, converter="omop")
We provide methods to read and write documents (raw or annotated) from and to parquet files.
As an example, imagine that we have the following document that uses the OMOP schema (parquet files are not actually stored as human-readable text, but this is for the sake of the example):
data.pq
{ "note_id": 0, "note_text": "Le patient ...", "note_datetime": "2021-10-23", "entities": [...] }
{ "note_id": 1, "note_text": "Autre doc ...", "note_datetime": "2022-12-24", "entities": [] }
...
You could also have multiple parquet files in a directory, the reader will read them all.