edsnlp.core.pipeline
Pipeline
New pipeline to use as a drop-in replacement for spaCy's pipeline. It uses PyTorch as the deep-learning backend and allows components to share subcomponents.
See the documentation for more details.
Parameters
PARAMETER | DESCRIPTION |
---|---|
lang | Language code TYPE: |
create_tokenizer | Function that creates a tokenizer for the pipeline TYPE: |
vocab | Whether to create a new vocab or use an existing one TYPE: |
batch_size | Batch size to use in the TYPE: |
vocab_config | Configuration for the vocab TYPE: |
meta | Meta information about the pipeline TYPE: |
disabled
property
The names of the disabled components
cfg: Config
property
Returns the config of the pipeline, including the config of all components. Updated from spacy to allow references between components.
get_pipe
Get a component by its name.
Parameters
PARAMETER | DESCRIPTION |
---|---|
name | The name of the component to get. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Pipe | |
has_pipe
Check if a component exists in the pipeline.
Parameters
PARAMETER | DESCRIPTION |
---|---|
name | The name of the component to check. TYPE: |
RETURNS | DESCRIPTION |
---|---|
bool | |
create_pipe
Create a component from a factory name.
Parameters
PARAMETER | DESCRIPTION |
---|---|
factory | The name of the factory to use TYPE: |
name | The name of the component TYPE: |
config | The config to pass to the factory TYPE: |
RETURNS | DESCRIPTION |
---|---|
Pipe | |
add_pipe
Add a component to the pipeline.
Parameters
PARAMETER | DESCRIPTION |
---|---|
factory | The name of the component to add or the component itself TYPE: |
name | The name of the component. If not provided, the name of the component will be used if it has one (.name), otherwise the factory name will be used. TYPE: |
first | Whether to add the component to the beginning of the pipeline. This argument is mutually exclusive with TYPE: |
before | The name of the component to add the new component before. This argument is mutually exclusive with TYPE: |
after | The name of the component to add the new component after. This argument is mutually exclusive with TYPE: |
config | The arguments to pass to the component factory. Note that instead of replacing arguments with the same keys, the config will be merged with the default config of the component. This means that you can override specific nested arguments without having to specify the entire config. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Pipe | The component that was added to the pipeline. |
get_pipe_meta
Get the meta information for a component.
Parameters
PARAMETER | DESCRIPTION |
---|---|
name | The name of the component to get the meta for. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Dict[str, Any] | |
make_doc
Create a Doc from text.
Parameters
PARAMETER | DESCRIPTION |
---|---|
text | The text to create the Doc from. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Doc | |
__call__
Apply each component successively on a document.
Parameters
PARAMETER | DESCRIPTION |
---|---|
text | The text to create the Doc from, or a Doc. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Doc | |
pipe
Process a stream of documents by applying each component successively on batches of documents.
Parameters
PARAMETER | DESCRIPTION |
---|---|
inputs | The inputs to create the Docs from, or Docs directly. TYPE: |
batch_size | The batch size to use. If not provided, the batch size of the pipeline object will be used. TYPE: |
n_process | Deprecated. Use the ".set(num_cpu_workers=n_process)" method on the returned data lazy collection instead. The number of parallel workers to use. If 0, the operations will be executed sequentially. TYPE: |
RETURNS | DESCRIPTION |
---|---|
LazyCollection | |
cache
Enable caching for all (trainable) components in the pipeline
torch_components
Yields components that are PyTorch modules.
Parameters
PARAMETER | DESCRIPTION |
---|---|
disable | The names of disabled components, which will be skipped. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Iterable[Tuple[str, TorchComponent]] | |
post_init
Completes the initialization of the pipeline by calling the post_init method of all components that have one. This is useful for components that need to see some data to build their vocabulary, for instance.
Parameters
PARAMETER | DESCRIPTION |
---|---|
data | The documents to use for initialization. Each component will not necessarily see all the data. TYPE: |
exclude | Components to exclude from post initialization on data TYPE: |
from_config
classmethod
Create a pipeline from a config object
Parameters
PARAMETER | DESCRIPTION |
---|---|
config | The config to use TYPE: |
vocab | The spaCy vocab to use. If True, a new vocab will be created TYPE: |
disable | Components to disable TYPE: |
enable | Components to enable TYPE: |
exclude | Components to exclude TYPE: |
meta | Metadata to add to the pipeline TYPE: |
RETURNS | DESCRIPTION |
---|---|
Pipeline | |
__get_validators__
classmethod
Pydantic validators generator
validate
classmethod
Pydantic validator, used in the validate_arguments
decorated functions
preprocess
Run the preprocessing methods of each component in the pipeline on a document and returns a dictionary containing the results, with the component names as keys.
Parameters
PARAMETER | DESCRIPTION |
---|---|
doc | The document to preprocess TYPE: |
supervision | Whether to include supervision information in the preprocessing TYPE: |
RETURNS | DESCRIPTION |
---|---|
Dict[str, Any] | |
preprocess_many
Runs the preprocessing methods of each component in the pipeline on a collection of documents and returns an iterable of dictionaries containing the results, with the component names as keys.
Parameters
PARAMETER | DESCRIPTION |
---|---|
docs | TYPE: |
compress | Whether to deduplicate identical preprocessing outputs of the results if multiple documents share identical subcomponents. This step is required to enable the cache mechanism when training or running the pipeline over a tabular datasets such as pyarrow tables that do not store referential equality information. DEFAULT: |
supervision | Whether to include supervision information in the preprocessing DEFAULT: |
RETURNS | DESCRIPTION |
---|---|
Iterable[OutputT] | |
collate
Collates a batch of preprocessed samples into a single (maybe nested) dictionary of tensors by calling the collate method of each component.
Parameters
PARAMETER | DESCRIPTION |
---|---|
batch | The batch of preprocessed samples TYPE: |
RETURNS | DESCRIPTION |
---|---|
Dict[str, Any] | The collated batch |
parameters
Returns an iterator over the Pytorch parameters of the components in the pipeline
named_parameters
Returns an iterator over the Pytorch parameters of the components in the pipeline
to
Moves the pipeline to a given device
train
Enables training mode on pytorch modules
Parameters
PARAMETER | DESCRIPTION |
---|---|
mode | Whether to enable training or not DEFAULT: |
to_disk
Save the pipeline to a directory.
Parameters
PARAMETER | DESCRIPTION |
---|---|
path | The path to the directory to save the pipeline to. Every component will be saved to separated subdirectories of this directory, except for tensors that will be saved to a shared files depending on the references between the components. TYPE: |
exclude | The names of the components, or attributes to exclude from the saving process. TYPE: |
from_disk
Load the pipeline from a directory. Components will be updated in-place.
Parameters
PARAMETER | DESCRIPTION |
---|---|
path | The path to the directory to load the pipeline from TYPE: |
exclude | The names of the components, or attributes to exclude from the loading process. TYPE: |
device | Device to use when loading the tensors TYPE: |
select_pipes
Temporarily disable and enable components in the pipeline.
Parameters
PARAMETER | DESCRIPTION |
---|---|
disable | The name of the component to disable, or a list of names. TYPE: |
enable | The name of the component to enable, or a list of names. TYPE: |
blank
Loads an empty EDS-NLP Pipeline, similarly to spacy.blank
. In addition to standard components, this pipeline supports EDS-NLP trainable torch components.
Examples
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe("eds.covid")
Parameters
PARAMETER | DESCRIPTION |
---|---|
lang | Language ID, e.g. "en", "fr", "eds", etc. TYPE: |
config | The config to use for the pipeline TYPE: |
RETURNS | DESCRIPTION |
---|---|
Pipeline | The new empty pipeline instance. |
load
Load a pipeline from a config file or a directory.
Examples
import edsnlp
nlp = edsnlp.load(
"path/to/config.cfg",
overrides={"components": {"my_component": {"arg": "value"}}},
)
Parameters
PARAMETER | DESCRIPTION |
---|---|
model | The config to use for the pipeline, or the path to a config file or a directory. TYPE: |
overrides | Overrides to apply to the config when loading the pipeline. These are the same parameters as the ones used when initializing the pipeline. TYPE: |
exclude | The names of the components, or attributes to exclude from the loading process. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Pipeline | |