`edsnlp.core.pipeline`

`Pipeline`

New pipeline to use as a drop-in replacement for spaCy's pipeline. It uses PyTorch as the deep-learning backend and allows components to share subcomponents.

See the documentation for more details.

Parameters

PARAMETER	DESCRIPTION
`lang`	Language code TYPE: `str`
`create_tokenizer`	Function that creates a tokenizer for the pipeline TYPE: `Callable[[Pipeline], Optional[Tokenizer]]` DEFAULT: `None`
`vocab`	Whether to create a new vocab or use an existing one TYPE: `Union[bool, Vocab]` DEFAULT: `True`
`batch_size`	Batch size to use in the `.pipe()` method TYPE: `Optional[int]` DEFAULT: `128`
`vocab_config`	Configuration for the vocab TYPE: `Type[BaseDefaults]` DEFAULT: `None`
`meta`	Meta information about the pipeline TYPE: `Dict[str, Any]` DEFAULT: `None`

`disabled` `property`

The names of the disabled components

`cfg: Config` `property`

Returns the config of the pipeline, including the config of all components. Updated from spacy to allow references between components.

`get_pipe`

Get a component by its name.

Parameters

PARAMETER DESCRIPTION

name

The name of the component to get.

TYPE: str

RETURNS	DESCRIPTION
`Pipe`

`has_pipe`

Check if a component exists in the pipeline.

Parameters

PARAMETER DESCRIPTION

name

The name of the component to check.

TYPE: str

RETURNS	DESCRIPTION
`bool`

`create_pipe`

Create a component from a factory name.

Parameters

PARAMETER DESCRIPTION

factory

The name of the factory to use

TYPE: str

name

The name of the component

TYPE: str

config

The config to pass to the factory

TYPE: Dict[str, Any] DEFAULT: None

RETURNS	DESCRIPTION
`Pipe`

`add_pipe`

Add a component to the pipeline.

Parameters

PARAMETER	DESCRIPTION
`factory`	The name of the component to add or the component itself TYPE: `Union[str, Pipe]`
`name`	The name of the component. If not provided, the name of the component will be used if it has one (.name), otherwise the factory name will be used. TYPE: `Optional[str]` DEFAULT: `None`
`first`	Whether to add the component to the beginning of the pipeline. This argument is mutually exclusive with `before` and `after`. TYPE: `bool` DEFAULT: `False`
`before`	The name of the component to add the new component before. This argument is mutually exclusive with `after` and `first`. TYPE: `Optional[str]` DEFAULT: `None`
`after`	The name of the component to add the new component after. This argument is mutually exclusive with `before` and `first`. TYPE: `Optional[str]` DEFAULT: `None`
`config`	The arguments to pass to the component factory. Note that instead of replacing arguments with the same keys, the config will be merged with the default config of the component. This means that you can override specific nested arguments without having to specify the entire config. TYPE: `Optional[Dict[str, Any]]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Pipe`	The component that was added to the pipeline.

`get_pipe_meta`

Get the meta information for a component.

Parameters

PARAMETER DESCRIPTION

name

The name of the component to get the meta for.

TYPE: str

RETURNS	DESCRIPTION
`Dict[str, Any]`

`make_doc`

Create a Doc from text.

Parameters

PARAMETER DESCRIPTION

text

The text to create the Doc from.

TYPE: str

RETURNS	DESCRIPTION
`Doc`

`call`

Apply each component successively on a document.

Parameters

PARAMETER DESCRIPTION

text

The text to create the Doc from, or a Doc.

TYPE: Union[str, Doc]

RETURNS	DESCRIPTION
`Doc`

`pipe`

Process a stream of documents by applying each component successively on batches of documents.

Parameters

PARAMETER DESCRIPTION

inputs

The inputs to create the Docs from, or Docs directly.

TYPE: Any

batch_size

The batch size to use. If not provided, the batch size of the pipeline object will be used.

TYPE: Optional[int] DEFAULT: None

n_process

Deprecated. Use the ".set(num_cpu_workers=n_process)" method on the returned data lazy collection instead. The number of parallel workers to use. If 0, the operations will be executed sequentially.

TYPE: int DEFAULT: None

RETURNS	DESCRIPTION
`LazyCollection`

`cache`

Enable caching for all (trainable) components in the pipeline

`torch_components`

Yields components that are PyTorch modules.

Parameters

PARAMETER DESCRIPTION

disable

The names of disabled components, which will be skipped.

TYPE: Container[str] DEFAULT: ()

RETURNS	DESCRIPTION
`Iterable[Tuple[str, TorchComponent]]`

`post_init`

Completes the initialization of the pipeline by calling the post_init method of all components that have one. This is useful for components that need to see some data to build their vocabulary, for instance.

Parameters

PARAMETER DESCRIPTION

data

The documents to use for initialization. Each component will not necessarily see all the data.

TYPE: Iterable[Doc]

exclude

Components to exclude from post initialization on data

TYPE: Optional[Set] DEFAULT: None

`from_config` `classmethod`

Create a pipeline from a config object

Parameters

PARAMETER	DESCRIPTION
`config`	The config to use TYPE: `Dict[str, Any]` DEFAULT: `{}`
`vocab`	The spaCy vocab to use. If True, a new vocab will be created TYPE: `Union[Vocab, bool]` DEFAULT: `True`
`disable`	Components to disable TYPE: `Union[str, Iterable[str]]` DEFAULT: `EMPTY_LIST`
`enable`	Components to enable TYPE: `Union[str, Iterable[str]]` DEFAULT: `EMPTY_LIST`
`exclude`	Components to exclude TYPE: `Union[str, Iterable[str]]` DEFAULT: `EMPTY_LIST`
`meta`	Metadata to add to the pipeline TYPE: `Dict[str, Any]` DEFAULT: `FrozenDict()`

RETURNS	DESCRIPTION
`Pipeline`

`__get_validators__` `classmethod`

Pydantic validators generator

`validate` `classmethod`

Pydantic validator, used in the validate_arguments decorated functions

`preprocess`

Run the preprocessing methods of each component in the pipeline on a document and returns a dictionary containing the results, with the component names as keys.

Parameters

PARAMETER DESCRIPTION

doc

The document to preprocess

TYPE: Doc

supervision

Whether to include supervision information in the preprocessing

TYPE: bool DEFAULT: False

RETURNS	DESCRIPTION
`Dict[str, Any]`

`preprocess_many`

Runs the preprocessing methods of each component in the pipeline on a collection of documents and returns an iterable of dictionaries containing the results, with the component names as keys.

Parameters

PARAMETER DESCRIPTION

docs

TYPE: Iterable[Doc]

compress

Whether to deduplicate identical preprocessing outputs of the results if multiple documents share identical subcomponents. This step is required to enable the cache mechanism when training or running the pipeline over a tabular datasets such as pyarrow tables that do not store referential equality information.

DEFAULT: True

supervision

Whether to include supervision information in the preprocessing

DEFAULT: True

RETURNS	DESCRIPTION
`Iterable[OutputT]`

`collate`

Collates a batch of preprocessed samples into a single (maybe nested) dictionary of tensors by calling the collate method of each component.

Parameters

PARAMETER DESCRIPTION

batch

The batch of preprocessed samples

TYPE: Dict[str, Any]

RETURNS	DESCRIPTION
`Dict[str, Any]`	The collated batch

`parameters`

Returns an iterator over the Pytorch parameters of the components in the pipeline

`named_parameters`

Returns an iterator over the Pytorch parameters of the components in the pipeline

`to`

Moves the pipeline to a given device

`train`

Enables training mode on pytorch modules

Parameters

PARAMETER DESCRIPTION

mode

Whether to enable training or not

DEFAULT: True

`to_disk`

Save the pipeline to a directory.

Parameters

PARAMETER DESCRIPTION

path

The path to the directory to save the pipeline to. Every component will be saved to separated subdirectories of this directory, except for tensors that will be saved to a shared files depending on the references between the components.

TYPE: Union[str, Path]

exclude

The names of the components, or attributes to exclude from the saving process.

TYPE: Optional[Set[str]] DEFAULT: None

`from_disk`

Load the pipeline from a directory. Components will be updated in-place.

Parameters

PARAMETER DESCRIPTION

path

The path to the directory to load the pipeline from

TYPE: Union[str, Path]

exclude

The names of the components, or attributes to exclude from the loading process.

TYPE: Optional[Union[str, Sequence[str]]] DEFAULT: None

device

Device to use when loading the tensors

TYPE: Optional[Union[str, device]] DEFAULT: 'cpu'

`select_pipes`

Temporarily disable and enable components in the pipeline.

Parameters

PARAMETER DESCRIPTION

disable

The name of the component to disable, or a list of names.

TYPE: Optional[Union[str, Iterable[str]]] DEFAULT: None

enable

The name of the component to enable, or a list of names.

TYPE: Optional[Union[str, Iterable[str]]] DEFAULT: None

`blank`

Loads an empty EDS-NLP Pipeline, similarly to spacy.blank. In addition to standard components, this pipeline supports EDS-NLP trainable torch components.

Examples

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.covid")

Parameters

PARAMETER DESCRIPTION

lang

Language ID, e.g. "en", "fr", "eds", etc.

TYPE: str

config

The config to use for the pipeline

TYPE: Union[Dict[str, Any], Config] DEFAULT: {}

RETURNS	DESCRIPTION
`Pipeline`	The new empty pipeline instance.

`load`

Load a pipeline from a config file or a directory.

Examples

import edsnlp

nlp = edsnlp.load(
    "path/to/config.cfg",
    overrides={"components": {"my_component": {"arg": "value"}}},
)

Parameters

PARAMETER DESCRIPTION

model

The config to use for the pipeline, or the path to a config file or a directory.

TYPE: Union[Path, str, Config]

overrides

Overrides to apply to the config when loading the pipeline. These are the same parameters as the ones used when initializing the pipeline.

TYPE: Optional[Dict[str, Any]] DEFAULT: None

exclude

The names of the components, or attributes to exclude from the loading process. The exclude argument will be mutated in place.

TYPE: Optional[Union[str, Iterable[str]]] DEFAULT: None

RETURNS	DESCRIPTION
`Pipeline`

edsnlp.core.pipeline

Pipeline

Parameters

disabled property

cfg: Config property

get_pipe

Parameters

has_pipe

Parameters

create_pipe

Parameters

add_pipe

Parameters

get_pipe_meta

Parameters

make_doc

Parameters

__call__

Parameters

pipe

Parameters

cache

torch_components

Parameters

post_init

Parameters

from_config classmethod

Parameters

__get_validators__ classmethod

validate classmethod

preprocess

Parameters

preprocess_many

Parameters

collate

Parameters

parameters

named_parameters

to

train

Parameters

to_disk

Parameters

from_disk

Parameters

select_pipes

Parameters

blank

Examples

Parameters

load

Examples

Parameters

`edsnlp.core.pipeline`

`Pipeline`

`disabled` `property`

`cfg: Config` `property`

`get_pipe`

`has_pipe`

`create_pipe`

`add_pipe`

`get_pipe_meta`

`make_doc`

`call`

`pipe`

`cache`

`torch_components`

`post_init`

`from_config` `classmethod`

`__get_validators__` `classmethod`

`validate` `classmethod`

`preprocess`

`preprocess_many`

`collate`

`parameters`

`named_parameters`

`to`

`train`

`to_disk`

`from_disk`

`select_pipes`

`blank`

`load`