HuggingFace datasets
TLDR
import edsnlp
# Read from the Hub (streaming) and convert to Docs
stream = edsnlp.data.from_huggingface_dataset(
"lhoestq/conll2003",
split="train",
converter="hf_ner",
tag_order=[
"O",
"B-PER",
"I-PER",
"B-ORG",
"I-ORG",
"B-LOC",
"I-LOC",
"B-MISC",
"I-MISC",
],
nlp=edsnlp.blank("eds"),
load_kwargs={"streaming": True},
)
# Optionally process
stream = stream.map_pipeline(nlp)
# Export back to a HF IterableDataset of dicts
hf_iter = edsnlp.data.to_huggingface_dataset(
stream,
converter="hf_ner",
words_column="tokens",
ner_tags_column="ner_tags",
)
Use the Hugging Face Datasets ecosystem as a data source or sink for EDS-NLP pipelines. You can read datasets from the Hub or reuse already loaded datasets.Dataset / datasets.IterableDataset objects, optionally shuffle them deterministically, loop over them, and map them through any pipeline before writing them back as an IterableDataset.
We rely on the datasets package. Install it with pip install datasets or pip install "edsnlp[ml]".
Typical converters:
hf_ner: expects token and tag columns (defaults:tokens,ner_tags) and produces Docs with entities. Compatible with BILOU/IOB schemes throughtag_orderortag_map.hf_text: expects a single text column (default:text) and produces plain Docs; optionalid_columnis inferred when present.
When loading a dataset dictionary with multiple splits, pass an explicit split (e.g. "train"). You can also select a configuration/subset via name and forward any datasets.load_dataset arguments through load_kwargs (e.g. {"streaming": True}).
Reading Hugging Face datasets
Load a dataset from the HuggingFace Hub as a Stream.
Example
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.from_huggingface_dataset(
"lhoestq/conll2003",
split="train",
tag_order=[
'O',
'B-PER',
'I-PER',
'B-ORG',
'I-ORG',
'B-LOC',
'I-LOC',
'B-MISC',
'I-MISC',
],
converter="hf_ner",
)
annotated_docs = nlp.pipe(doc_iterator)
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
dataset | Either a dataset identifier (e.g. "conll2003") or an already loaded TYPE: |
split | Which split to load (e.g. "train"). If None, the default dataset split returned by TYPE: |
name | Configuration name for datasets with multiple configs (e.g. "en" for a multilingual dataset). Also known as the subset name. TYPE: |
converter | Converter(s) to transform dataset dicts to Doc objects. Recommended converters are TYPE: |
shuffle | Whether to shuffle the dataset before yielding. If True or 'dataset', the whole dataset will be materialized and shuffled (may be expensive). TYPE: |
seed | Random seed for shuffling. TYPE: |
loop | Whether to loop over the dataset indefinitely. TYPE: |
load_kwargs | Dictionary of additional kwargs that will be passed to the TYPE: |
kwargs | Additional keyword arguments passed to the converter, these are documented in the Converters page. DEFAULT: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
Writing Hugging Face datasets
Convert a collection/Stream of Doc objects (or already-converted dicts) into a datasets.IterableDataset.
Examples
1) Convert a Stream of HuggingFace NER examples into Doc objects (reader), process them and create an IterableDataset of dictionaries using the hf_ner writer converter:
import edsnlp
stream = edsnlp.data.from_huggingface_dataset(
"lhoestq/conll2003",
split="train",
converter="hf_ner",
)
# Apply a pipeline or other processing
stream = stream.map_pipeline(nlp)
# Export as HF IterableDataset of dicts (no push)
hf_iter = edsnlp.data.to_huggingface_dataset(
stream,
converter="hf_ner",
)
2) Convert plain text Docs to HF text-format dicts:
edsnlp.data.to_huggingface_dataset(
docs_stream,
converter=("hf_text"),
execute=True,
# converter kwargs are validated and forwarded by
# `get_doc2dict_converter` (e.g. `text_column`, `id_column`).
)
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
data | Iterable of TYPE: |
converter | Converter name or callable used to transform TYPE: |
execute | If False, return a transformed TYPE: |
**kwargs | Extra kwargs forwarded to the converter factory. DEFAULT: |
| RETURNS | DESCRIPTION |
|---|---|
Union[IterableDataset, Dataset] | An |