Skip to content

HuggingFace datasets

TLDR
import edsnlp

# Read from the Hub (streaming) and convert to Docs
stream = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    converter="hf_ner",
    tag_order=[
        "O",
        "B-PER",
        "I-PER",
        "B-ORG",
        "I-ORG",
        "B-LOC",
        "I-LOC",
        "B-MISC",
        "I-MISC",
    ],
    nlp=edsnlp.blank("eds"),
    load_kwargs={"streaming": True},
)

# Optionally process
stream = stream.map_pipeline(nlp)

# Export back to a HF IterableDataset of dicts
hf_iter = edsnlp.data.to_huggingface_dataset(
    stream,
    converter="hf_ner",
    words_column="tokens",
    ner_tags_column="ner_tags",
)

Use the Hugging Face Datasets ecosystem as a data source or sink for EDS-NLP pipelines. You can read datasets from the Hub or reuse already loaded datasets.Dataset / datasets.IterableDataset objects, optionally shuffle them deterministically, loop over them, and map them through any pipeline before writing them back as an IterableDataset.

We rely on the datasets package. Install it with pip install datasets or pip install "edsnlp[ml]".

Typical converters:

  • hf_ner: expects token and tag columns (defaults: tokens, ner_tags) and produces Docs with entities. Compatible with BILOU/IOB schemes through tag_order or tag_map.
  • hf_text: expects a single text column (default: text) and produces plain Docs; optional id_column is inferred when present.

When loading a dataset dictionary with multiple splits, pass an explicit split (e.g. "train"). You can also select a configuration/subset via name and forward any datasets.load_dataset arguments through load_kwargs (e.g. {"streaming": True}).

Reading Hugging Face datasets

Load a dataset from the HuggingFace Hub as a Stream.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    tag_order=[
        'O',
        'B-PER',
        'I-PER',
        'B-ORG',
        'I-ORG',
        'B-LOC',
        'I-LOC',
        'B-MISC',
        'I-MISC',
    ],
    converter="hf_ner",
)
annotated_docs = nlp.pipe(doc_iterator)

Parameters

PARAMETER DESCRIPTION
dataset

Either a dataset identifier (e.g. "conll2003") or an already loaded datasets.Dataset / datasets.IterableDataset object.

TYPE: Union[str, Any]

split

Which split to load (e.g. "train"). If None, the default dataset split returned by datasets.load_dataset is used.

TYPE: Optional[str] DEFAULT: None

name

Configuration name for datasets with multiple configs (e.g. "en" for a multilingual dataset). Also known as the subset name.

TYPE: Optional[str] DEFAULT: None

converter

Converter(s) to transform dataset dicts to Doc objects. Recommended converters are "hf_ner" and "hf_text". More information is available in the Converters page.

TYPE: Optional[Union[str, Callable]] DEFAULT: None

shuffle

Whether to shuffle the dataset before yielding. If True or 'dataset', the whole dataset will be materialized and shuffled (may be expensive).

TYPE: Union[Literal['dataset'], bool] DEFAULT: False

seed

Random seed for shuffling.

TYPE: Optional[int] DEFAULT: None

loop

Whether to loop over the dataset indefinitely.

TYPE: bool DEFAULT: False

load_kwargs

Dictionary of additional kwargs that will be passed to the datasets.load_dataset() method.

TYPE: Optional[Dict[str, Any]] DEFAULT: None

kwargs

Additional keyword arguments passed to the converter, these are documented in the Converters page.

DEFAULT: {}

RETURNS DESCRIPTION
Stream

Writing Hugging Face datasets

Convert a collection/Stream of Doc objects (or already-converted dicts) into a datasets.IterableDataset.

Examples

1) Convert a Stream of HuggingFace NER examples into Doc objects (reader), process them and create an IterableDataset of dictionaries using the hf_ner writer converter:

import edsnlp

stream = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    converter="hf_ner",
)

# Apply a pipeline or other processing
stream = stream.map_pipeline(nlp)

# Export as HF IterableDataset of dicts (no push)
hf_iter = edsnlp.data.to_huggingface_dataset(
    stream,
    converter="hf_ner",
)

2) Convert plain text Docs to HF text-format dicts:

edsnlp.data.to_huggingface_dataset(
    docs_stream,
    converter=("hf_text"),
    execute=True,
    # converter kwargs are validated and forwarded by
    # `get_doc2dict_converter` (e.g. `text_column`, `id_column`).
)

Parameters

PARAMETER DESCRIPTION
data

Iterable of Doc objects or a Stream. If converter is provided the stream items are expected to be Doc objects. Otherwise items should already be mapping-like dicts.

TYPE: Union[Any, Stream]

converter

Converter name or callable used to transform Doc -> dict before creating the dataset. Typical values: "hf_ner_doc2dict" or "hf_text_doc2dict". Converter kwargs may be passed via **kwargs.

TYPE: Optional[Union[str, Callable]] DEFAULT: None

execute

If False, return a transformed Stream (not executed). If True (default) produce and return a datasets.IterableDataset.

TYPE: bool DEFAULT: True

**kwargs

Extra kwargs forwarded to the converter factory.

DEFAULT: {}

RETURNS DESCRIPTION
Union[IterableDataset, Dataset]

An IterableDataset containing the converted data.