Skip to content

edsnlp.data.base

from_iterable [source]

The IterableReader (or edsnlp.data.from_iterable) reads a list of Python objects ( texts, dictionaries, ...) and yields documents by passing them through the converter if given, or returns them as is.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.from_iterable([{...}], nlp=nlp, converter=...)
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.from_iterable returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list

docs = list(edsnlp.data.from_iterable([{...}], converter=...)

Parameters

PARAMETER DESCRIPTION
data

The data to read

TYPE: Any

converter

Converters to use to convert the JSON rows of the data source to Doc objects

TYPE: Optional[AsList[Union[str, Callable]]] DEFAULT: None

read_in_worker

In multiprocessing mode, whether to read the data in the worker processes. If True, the data will be read in the worker processes, requires pickling the input iterable: this is mostly useful if the pickled iterable is smaller than the data itself (eg, an infinite generator of synthetic data). If False, the data will be read in the main process and distributed to the workers.

TYPE: bool DEFAULT: False

kwargs

Additional keyword arguments to pass to the converter. These are documented on the Converters page.

DEFAULT: {}

shuffle

Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping).

TYPE: Literal['dataset', False] DEFAULT: False

seed

The seed to use for shuffling.

TYPE: Optional[int] DEFAULT: None

loop

Whether to loop over the data indefinitely.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Stream

to_iterable [source]

edsnlp.data.to_iterable returns an iterator of documents, as converted by the converter. In comparison to just iterating over a Stream, this will also apply the converter to the documents, which can lower the data transfer overhead when using multiprocessing.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)

doc = nlp("My document with entities")

edsnlp.data.to_iterable([doc], converter="omop")

Parameters

PARAMETER DESCRIPTION
data

The data to write (either a list of documents or a Stream).

TYPE: Union[Any, Stream]

converter

Converter to use to convert the documents to dictionary objects.

TYPE: Optional[Union[str, Callable]] DEFAULT: None

kwargs

Additional keyword arguments passed to the converter. These are documented on the Converters page.

DEFAULT: {}