Inference
Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference. This page answers the following questions :
How do we leverage computational resources run a model on many documents?
How do we connect to various data sources to retrieve documents?
Inference on a single document
In EDS-NLP, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
- a text string
- or a Doc object
from pathlib import Path
nlp = ...
text = "... my text ..."
doc = nlp(text)
If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the multiprocessing accelerator description below.
nlp.to("cuda") # same semantics as pytorch
doc = nlp(text)
Inference on multiple documents
When processing multiple documents, it is usually more efficient to use the nlp.pipe(...)
method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-NLP comes with various "accelerators" to speed up inference (see the Accelerators section for more details). By default, the .pipe()
method uses the simple
accelerator but you can switch to a different one by passing the accelerator
argument.
nlp = ...
docs = nlp.pipe(
[text1, text2, ...],
batch_size=16, # optional, default to the one defined in the pipeline
accelerator=my_accelerator,
)
The pipe
method supports the following arguments :
Parameters
PARAMETER | DESCRIPTION |
---|---|
inputs | The inputs to create the Docs from, or Docs directly. TYPE: |
batch_size | The batch size to use. If not provided, the batch size of the pipeline object will be used. TYPE: |
accelerator | The accelerator to use for processing the documents. If not provided, the default accelerator will be used. TYPE: |
to_doc | The function to use to convert the inputs to PDFDoc objects. By default, the TYPE: |
from_doc | The function to use to convert the PDFDoc objects to outputs. By default, the PDFDoc objects will be returned directly. TYPE: |
Accelerators
Simple accelerator
This is the simplest accelerator which batches the documents and process each batch on the main process (the one calling .pipe()
).
Examples
docs = list(nlp.pipe([content1, content2, ...]))
or, if you want to override the model defined batch size
docs = list(nlp.pipe([content1, content2, ...], batch_size=8))
which is equivalent to passing a confit dict
docs = list(
nlp.pipe(
[text1, text2, ...],
accelerator={
"@accelerator": "simple",
"batch_size": 8,
},
)
)
or the instantiated accelerator directly
from edsnlp.accelerators.simple import SimpleAccelerator
accelerator = SimpleAccelerator(batch_size=8)
docs = list(nlp.pipe([content1, content2, ...], accelerator=accelerator))
If you have a GPU, make sure to move the model to the appropriate device before calling .pipe()
. If you have multiple GPUs, use the multiprocessing accelerator instead.
nlp.to("cuda")
docs = list(nlp.pipe([content1, content2, ...]))
Parameters
PARAMETER | DESCRIPTION |
---|---|
batch_size | The number of documents to process in each batch. TYPE: |
Multiprocessing (GPU) accelerator
If you have multiple CPU cores, and optionally multiple GPUs, we provide a multiprocessing
accelerator that allows to run the inference on multiple processes.
This accelerator dispatches the batches between multiple workers (data-parallelism), and distribute the computation of a given batch on one or two workers (model-parallelism). This is done by creating two types of workers:
- a
CPUWorker
which handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning components - a
GPUWorker
which handles the forward call of the deep-learning components
The advantage of dedicating a worker to the deep-learning components is that it allows to prepare multiple batches in parallel in multiple CPUWorker
, and ensure that the GPUWorker
never wait for a batch to be ready.
The overall architecture described in the following figure, for 3 CPU workers and 2 GPU workers.

Here is how a small pipeline with rule-based components and deep-learning components is distributed between the workers:

Examples
docs = list(
nlp.pipe(
[text1, text2, ...],
accelerator={
"@accelerator": "multiprocessing",
"num_cpu_workers": 3,
"num_gpu_workers": 2,
"batch_size": 8,
},
)
)
Parameters
PARAMETER | DESCRIPTION |
---|---|
batch_size | Number of documents to process at a time in a CPU/GPU worker TYPE: |
num_cpu_workers | Number of CPU workers. A CPU worker handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning components. TYPE: |
num_gpu_workers | Number of GPU workers. A GPU worker handles the forward call of the deep-learning components. TYPE: |
gpu_pipe_names | List of pipe names to accelerate on a GPUWorker, defaults to all pipes that inherit from TorchComponent TYPE: |