Data connectors
We provide various connectors to read and write data from and to different formats.
Reading from a given path or object takes the following form:
import edsnlp
docs = edsnlp.data.read_{format}( # or .from_{format} for objects
# Path to the file or directory
"path/to/file",
# How to convert JSON-like samples to Doc objects
converter="schema"
)
Writing to given path or object takes the following form:
import edsnlp
edsnlp.data.write_{format}( # or .to_{format} for objects
# Path to the file or directory
"path/to/file",
# Iterable of Doc objects
docs,
# How to convert Doc objects to JSON-like samples
converter="schema"
)
The overall process is illustrated in the following diagram:
At the moment, we support the following data sources:
Source | Description |
---|---|
JSON | .json and .jsonl files |
Standoff & BRAT | .ann and .txt files |
Pandas | Pandas DataFrame objects |
Spark | Spark DataFrame objects |
and the following schemas:
Schema | Shorthand |
---|---|
OMOP | omop |
Standoff | standoff |