Skip to content

Data connectors

We provide various connectors to read and write data from and to different formats.

Reading from a given path or object takes the following form:

import edsnlp

docs = edsnlp.data.read_{format}(  # or .from_{format} for objects
    # Path to the file or directory
    "path/to/file",
    # How to convert JSON-like samples to Doc objects
    converter=predefined schema or function,
)

Writing to given path or object takes the following form:

import edsnlp

edsnlp.data.write_{format}(  # or .to_{format} for objects
    # Path to the file or directory
    "path/to/file",
    # Iterable of Doc objects
    docs,
    # How to convert Doc objects to JSON-like samples
    converter=predefined schema or function,
)

The overall process is illustrated in the following diagram:

Data connectors overview

At the moment, we support the following data sources:

Source Description
JSON .json and .jsonl files
Standoff & BRAT .ann and .txt files
Pandas Pandas DataFrame objects
Polars Polars DataFrame objects
Spark Spark DataFrame objects

and the following schemas:

Schema Snippet
Custom converter=custom_fn
OMOP converter="omop"
Standoff converter="standoff"