HuggingFace datasets

TLDR

import edsnlp

# Read from the Hub (streaming) and convert to Docs
stream = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    converter="hf_ner",
    tag_order=[
        "O",
        "B-PER",
        "I-PER",
        "B-ORG",
        "I-ORG",
        "B-LOC",
        "I-LOC",
        "B-MISC",
        "I-MISC",
    ],
    nlp=edsnlp.blank("eds"),
    load_kwargs={"streaming": True},
)

# Optionally process
stream = stream.map_pipeline(nlp)

# Export back to a HF IterableDataset of dicts
hf_iter = edsnlp.data.to_huggingface_dataset(
    stream,
    converter="hf_ner",
    words_column="tokens",
    ner_tags_column="ner_tags",
)

Use the Hugging Face Datasets ecosystem as a data source or sink for EDS-NLP pipelines. You can read datasets from the Hub or reuse already loaded datasets.Dataset / datasets.IterableDataset objects, optionally shuffle them deterministically, loop over them, and map them through any pipeline before writing them back as an IterableDataset.

We rely on the datasets package. Install it with pip install datasets or pip install "edsnlp[ml]".

Typical converters:

hf_ner: expects token and tag columns (defaults: tokens, ner_tags) and produces Docs with entities. Compatible with BILOU/IOB schemes through tag_order or tag_map.
hf_text: expects a single text column (default: text) and produces plain Docs; optional id_column is inferred when present.

When loading a dataset dictionary with multiple splits, pass an explicit split (e.g. "train"). You can also select a configuration/subset via name and forward any datasets.load_dataset arguments through load_kwargs (e.g. {"streaming": True}).

Reading Hugging Face datasets

Load a dataset from the HuggingFace Hub as a Stream.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    tag_order=[
        'O',
        'B-PER',
        'I-PER',
        'B-ORG',
        'I-ORG',
        'B-LOC',
        'I-LOC',
        'B-MISC',
        'I-MISC',
    ],
    converter="hf_ner",
)
annotated_docs = nlp.pipe(doc_iterator)

Parameters

PARAMETER	DESCRIPTION
`dataset`	Either a dataset identifier (e.g. "conll2003") or an already loaded `datasets.Dataset` / `datasets.IterableDataset` object. TYPE: `Union[str, Any]`
`split`	Which split to load (e.g. "train"). If None, the default dataset split returned by `datasets.load_dataset` is used. TYPE: `Optional[str]` DEFAULT: `None`
`name`	Configuration name for datasets with multiple configs (e.g. "en" for a multilingual dataset). Also known as the subset name. TYPE: `Optional[str]` DEFAULT: `None`
`converter`	Converter(s) to transform dataset dicts to Doc objects. Recommended converters are `"hf_ner"` and `"hf_text"`. More information is available in the Converters page. TYPE: `Optional[Union[str, Callable]]` DEFAULT: `None`
`shuffle`	Whether to shuffle the dataset before yielding. If True or 'dataset', the whole dataset will be materialized and shuffled (may be expensive). TYPE: `Union[Literal['dataset'], bool]` DEFAULT: `False`
`seed`	Random seed for shuffling. TYPE: `Optional[int]` DEFAULT: `None`
`loop`	Whether to loop over the dataset indefinitely. TYPE: `bool` DEFAULT: `False`
`load_kwargs`	Dictionary of additional kwargs that will be passed to the `datasets.load_dataset()` method. TYPE: `Optional[Dict[str, Any]]` DEFAULT: `None`
`kwargs`	Additional keyword arguments passed to the converter, these are documented in the Converters page. DEFAULT: `{}`

RETURNS	DESCRIPTION
`Stream`

Writing Hugging Face datasets

Convert a collection/Stream of Doc objects (or already-converted dicts) into a datasets.IterableDataset.

Examples

1) Convert a Stream of HuggingFace NER examples into Doc objects (reader), process them and create an IterableDataset of dictionaries using the hf_ner writer converter:

import edsnlp

stream = edsnlp.data.from_huggingface_dataset(
    "lhoestq/conll2003",
    split="train",
    converter="hf_ner",
)

# Apply a pipeline or other processing
stream = stream.map_pipeline(nlp)

# Export as HF IterableDataset of dicts (no push)
hf_iter = edsnlp.data.to_huggingface_dataset(
    stream,
    converter="hf_ner",
)

2) Convert plain text Docs to HF text-format dicts:

edsnlp.data.to_huggingface_dataset(
    docs_stream,
    converter=("hf_text"),
    execute=True,
    # converter kwargs are validated and forwarded by
    # `get_doc2dict_converter` (e.g. `text_column`, `id_column`).
)

Parameters

PARAMETER	DESCRIPTION
`data`	Iterable of `Doc` objects or a `Stream`. If `converter` is provided the stream items are expected to be `Doc` objects. Otherwise items should already be mapping-like dicts. TYPE: `Union[Any, Stream]`
`converter`	Converter name or callable used to transform `Doc` -> dict before creating the dataset. Typical values: `"hf_ner_doc2dict"` or `"hf_text_doc2dict"`. Converter kwargs may be passed via `kwargs`. TYPE: `Optional[Union[str, Callable]]` DEFAULT:** `None`
`execute`	If False, return a transformed `Stream` (not executed). If True (default) produce and return a `datasets.IterableDataset`. TYPE: `bool` DEFAULT: `True`
`**kwargs`	Extra kwargs forwarded to the converter factory. DEFAULT: `{}`

RETURNS	DESCRIPTION
`Union[IterableDataset, Dataset]`	An `IterableDataset` containing the converted data.