Skip to content

edsnlp.core.pipeline

Pipeline

New pipeline to use as a drop-in replacement for spaCy's pipeline. It uses PyTorch as the deep-learning backend and allows components to share subcomponents.

See the documentation for more details.

Parameters

PARAMETER DESCRIPTION
lang

Language code

TYPE: str

create_tokenizer

Function that creates a tokenizer for the pipeline

TYPE: Callable[[Pipeline], Optional[Tokenizer]] DEFAULT: None

vocab

Whether to create a new vocab or use an existing one

TYPE: Union[bool, Vocab] DEFAULT: True

batch_size

Batch size to use in the .pipe() method

TYPE: Optional[int] DEFAULT: 128

vocab_config

Configuration for the vocab

TYPE: Type[BaseDefaults] DEFAULT: None

meta

Meta information about the pipeline

TYPE: Dict[str, Any] DEFAULT: None

disabled property

The names of the disabled components

cfg: Config property

Returns the config of the pipeline, including the config of all components. Updated from spacy to allow references between components.

get_pipe

Get a component by its name.

Parameters

PARAMETER DESCRIPTION
name

The name of the component to get.

TYPE: str

RETURNS DESCRIPTION
Pipe

has_pipe

Check if a component exists in the pipeline.

Parameters

PARAMETER DESCRIPTION
name

The name of the component to check.

TYPE: str

RETURNS DESCRIPTION
bool

create_pipe

Create a component from a factory name.

Parameters

PARAMETER DESCRIPTION
factory

The name of the factory to use

TYPE: str

name

The name of the component

TYPE: str

config

The config to pass to the factory

TYPE: Dict[str, Any] DEFAULT: None

RETURNS DESCRIPTION
Pipe

add_pipe

Add a component to the pipeline.

Parameters

PARAMETER DESCRIPTION
factory

The name of the component to add or the component itself

TYPE: Union[str, Pipe]

name

The name of the component. If not provided, the name of the component will be used if it has one (.name), otherwise the factory name will be used.

TYPE: Optional[str] DEFAULT: None

first

Whether to add the component to the beginning of the pipeline. This argument is mutually exclusive with before and after.

TYPE: bool DEFAULT: False

before

The name of the component to add the new component before. This argument is mutually exclusive with after and first.

TYPE: Optional[str] DEFAULT: None

after

The name of the component to add the new component after. This argument is mutually exclusive with before and first.

TYPE: Optional[str] DEFAULT: None

config

The arguments to pass to the component factory.

Note that instead of replacing arguments with the same keys, the config will be merged with the default config of the component. This means that you can override specific nested arguments without having to specify the entire config.

TYPE: Optional[Dict[str, Any]] DEFAULT: None

RETURNS DESCRIPTION
Pipe

The component that was added to the pipeline.

get_pipe_meta

Get the meta information for a component.

Parameters

PARAMETER DESCRIPTION
name

The name of the component to get the meta for.

TYPE: str

RETURNS DESCRIPTION
Dict[str, Any]

make_doc

Create a Doc from text.

Parameters

PARAMETER DESCRIPTION
text

The text to create the Doc from.

TYPE: str

RETURNS DESCRIPTION
Doc

__call__

Apply each component successively on a document.

Parameters

PARAMETER DESCRIPTION
text

The text to create the Doc from, or a Doc.

TYPE: Union[str, Doc]

RETURNS DESCRIPTION
Doc

pipe

Process a stream of documents by applying each component successively on batches of documents.

Parameters

PARAMETER DESCRIPTION
inputs

The inputs to create the Docs from, or Docs directly.

TYPE: Any

batch_size

The batch size to use. If not provided, the batch size of the pipeline object will be used.

TYPE: Optional[int] DEFAULT: None

n_process

Deprecated. Use the ".set(num_cpu_workers=n_process)" method on the returned data lazy collection instead. The number of parallel workers to use. If 0, the operations will be executed sequentially.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
LazyCollection

cache

Enable caching for all (trainable) components in the pipeline

torch_components

Yields components that are PyTorch modules.

Parameters

PARAMETER DESCRIPTION
disable

The names of disabled components, which will be skipped.

TYPE: Container[str] DEFAULT: ()

RETURNS DESCRIPTION
Iterable[Tuple[str, TorchComponent]]

post_init

Completes the initialization of the pipeline by calling the post_init method of all components that have one. This is useful for components that need to see some data to build their vocabulary, for instance.

Parameters

PARAMETER DESCRIPTION
data

The documents to use for initialization. Each component will not necessarily see all the data.

TYPE: Iterable[Doc]

exclude

Components to exclude from post initialization on data

TYPE: Optional[Set] DEFAULT: None

from_config classmethod

Create a pipeline from a config object

Parameters

PARAMETER DESCRIPTION
config

The config to use

TYPE: Dict[str, Any] DEFAULT: {}

vocab

The spaCy vocab to use. If True, a new vocab will be created

TYPE: Union[Vocab, bool] DEFAULT: True

disable

Components to disable

TYPE: Union[str, Iterable[str]] DEFAULT: EMPTY_LIST

enable

Components to enable

TYPE: Union[str, Iterable[str]] DEFAULT: EMPTY_LIST

exclude

Components to exclude

TYPE: Union[str, Iterable[str]] DEFAULT: EMPTY_LIST

meta

Metadata to add to the pipeline

TYPE: Dict[str, Any] DEFAULT: FrozenDict()

RETURNS DESCRIPTION
Pipeline

__get_validators__ classmethod

Pydantic validators generator

validate classmethod

Pydantic validator, used in the validate_arguments decorated functions

preprocess

Run the preprocessing methods of each component in the pipeline on a document and returns a dictionary containing the results, with the component names as keys.

Parameters

PARAMETER DESCRIPTION
doc

The document to preprocess

TYPE: Doc

supervision

Whether to include supervision information in the preprocessing

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Dict[str, Any]

preprocess_many

Runs the preprocessing methods of each component in the pipeline on a collection of documents and returns an iterable of dictionaries containing the results, with the component names as keys.

Parameters

PARAMETER DESCRIPTION
docs

TYPE: Iterable[Doc]

compress

Whether to deduplicate identical preprocessing outputs of the results if multiple documents share identical subcomponents. This step is required to enable the cache mechanism when training or running the pipeline over a tabular datasets such as pyarrow tables that do not store referential equality information.

DEFAULT: True

supervision

Whether to include supervision information in the preprocessing

DEFAULT: True

RETURNS DESCRIPTION
Iterable[OutputT]

collate

Collates a batch of preprocessed samples into a single (maybe nested) dictionary of tensors by calling the collate method of each component.

Parameters

PARAMETER DESCRIPTION
batch

The batch of preprocessed samples

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
Dict[str, Any]

The collated batch

parameters

Returns an iterator over the Pytorch parameters of the components in the pipeline

named_parameters

Returns an iterator over the Pytorch parameters of the components in the pipeline

to

Moves the pipeline to a given device

train

Enables training mode on pytorch modules

Parameters

PARAMETER DESCRIPTION
mode

Whether to enable training or not

DEFAULT: True

to_disk

Save the pipeline to a directory.

Parameters

PARAMETER DESCRIPTION
path

The path to the directory to save the pipeline to. Every component will be saved to separated subdirectories of this directory, except for tensors that will be saved to a shared files depending on the references between the components.

TYPE: Union[str, Path]

exclude

The names of the components, or attributes to exclude from the saving process.

TYPE: Optional[Set[str]] DEFAULT: None

from_disk

Load the pipeline from a directory. Components will be updated in-place.

Parameters

PARAMETER DESCRIPTION
path

The path to the directory to load the pipeline from

TYPE: Union[str, Path]

exclude

The names of the components, or attributes to exclude from the loading process.

TYPE: Optional[Union[str, Sequence[str]]] DEFAULT: None

device

Device to use when loading the tensors

TYPE: Optional[Union[str, device]] DEFAULT: 'cpu'

select_pipes

Temporarily disable and enable components in the pipeline.

Parameters

PARAMETER DESCRIPTION
disable

The name of the component to disable, or a list of names.

TYPE: Optional[Union[str, Iterable[str]]] DEFAULT: None

enable

The name of the component to enable, or a list of names.

TYPE: Optional[Union[str, Iterable[str]]] DEFAULT: None

blank

Loads an empty EDS-NLP Pipeline, similarly to spacy.blank. In addition to standard components, this pipeline supports EDS-NLP trainable torch components.

Examples

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.covid")

Parameters

PARAMETER DESCRIPTION
lang

Language ID, e.g. "en", "fr", "eds", etc.

TYPE: str

config

The config to use for the pipeline

TYPE: Union[Dict[str, Any], Config] DEFAULT: {}

RETURNS DESCRIPTION
Pipeline

The new empty pipeline instance.

load

Load a pipeline from a config file or a directory.

Examples

import edsnlp

nlp = edsnlp.load(
    "path/to/config.cfg",
    overrides={"components": {"my_component": {"arg": "value"}}},
)

Parameters

PARAMETER DESCRIPTION
model

The config to use for the pipeline, or the path to a config file or a directory.

TYPE: Union[Path, str, Config]

overrides

Overrides to apply to the config when loading the pipeline. These are the same parameters as the ones used when initializing the pipeline.

TYPE: Optional[Dict[str, Any]] DEFAULT: None

exclude

The names of the components, or attributes to exclude from the loading process. ⚠ The exclude argument will be mutated in place.

TYPE: Optional[Union[str, Iterable[str]]] DEFAULT: None

RETURNS DESCRIPTION
Pipeline