Data connectors
We provide various connectors to read and write data from and to different formats.
Reading from a given path or object takes the following form:
import edsnlp
docs = edsnlp.data.read_{format}( # or .from_{format} for objects
# Path to the file or directory
"path/to/file",
# How to convert JSON-like samples to Doc objects
converter=predefined schema or function,
)
Writing to given path or object takes the following form:
import edsnlp
edsnlp.data.write_{format}( # or .to_{format} for objects
# Path to the file or directory
"path/to/file",
# Iterable of Doc objects
docs,
# How to convert Doc objects to JSON-like samples
converter=predefined schema or function,
)
The overall process is illustrated in the following diagram:
At the moment, we support the following data sources:
Source | Description |
---|---|
JSON | .json and .jsonl files |
Standoff & BRAT | .ann and .txt files |
Pandas | Pandas DataFrame objects |
Polars | Polars DataFrame objects |
Spark | Spark DataFrame objects |
and the following schemas:
Schema | Snippet |
---|---|
Custom | converter=custom_fn |
OMOP | converter="omop" |
Standoff | converter="standoff" |
Ents | converter="ents" |