Skip to content

Inference

Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference. This page answers the following questions :

How do we leverage computational resources run a model on many documents?

How do we connect to various data sources to retrieve documents?

Be sure to check out the Processing multiple texts tutorial for a practical example of how to use EDS-NLP to process large datasets.

Inference on a single document

In EDS-NLP, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:

  • a text string
  • or a Doc object
from pathlib import Path

nlp = ...
text = "... my text ..."
doc = nlp(text)

If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline.

nlp.to("cuda")  # same semantics as pytorch
doc = nlp(text)

To leverage multiple GPUs when processing multiple documents, refer to the multiprocessing backend description below.

Inference on multiple documents

When processing multiple documents, we can optimize the inference by parallelizing the computation on a single core, multiple cores and GPUs or even multiple machines.

Lazy collection

These optimizations are enabled by performing lazy inference : the operations (e.g., reading a document, converting it to a Doc, running the different pipes of a model or writing the result somewhere) are not executed immediately but are instead scheduled in a LazyCollection object. It can then be executed by calling the execute method, iterating over it or calling a writing method (e.g., to_pandas). In fact, data connectors like edsnlp.data.read_json return a lazy collection, as well as the nlp.pipe method.

A lazy collection contains :

  • a reader: the source of the data (e.g., a file, a database, a list of strings, etc.)
  • the list of operations to perform under a pipeline attribute containing the name if any, function / pipe, keyword arguments and context for each operation
  • an optional writer: the destination of the data (e.g., a file, a database, a list of strings, etc.)
  • the execution config, containing the backend to use and its configuration such as the number of workers, the batch size, etc.

All methods (.map, .map_batches, .map_gpu, .map_pipeline, .set_processing) of the lazy collection are chainable, meaning that they return a new lazy collection object (no in-place modification).

For instance, the following code will load a model, read a folder of JSON files, apply the model to each document and write the result in a Parquet folder, using 4 CPUs and 2 GPUs.

import edsnlp

# Load or create a model
nlp = edsnlp.load("path/to/model")

# Read some data (this is lazy, no data will be read until the end of of this snippet)
data = edsnlp.data.read_json("path/to/json_folder", converter="...")

# Apply each pipe of the model to our documents
data = data.map_pipeline(nlp)
# or equivalently : data = nlp.pipe(data)

# Configure the execution
data = data.set_processing(
    # 4 CPUs to parallelize rule-based pipes, IO and preprocessing
    num_cpu_workers=4,
    # 2 GPUs to accelerate deep-learning pipes
    num_gpu_workers=2,

    # Below are further options to finetune the inference throughput:
    # Read chunks of 1024 documents before splitting them into batches
    chunk_size=1024,
    # Sort the documents by length before splitting them into batches
    sort_chunks=True,
    # Split batches such that each contains at most 100 000 padded words
    # (padded_words = max doc size in batch * batch size)
    batch_size=100_000,
    batch_by="padded_words",
)

# Write the result, this will execute the lazy collection
data.write_parquet("path/to/output_folder", converter="...", write_in_worker=True)

Applying operations to a lazy collection

To apply an operation to a lazy collection, you can use the .map method. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection.

To apply an operation to a lazy collection in batches, you can use the .map_batches method. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each batch of the collection (as a list of elements), and should return a list of results, that will be concatenated at the end.

To apply a model, you can use the .map_pipeline method. It takes a model as input and will add every pipe of the model to the scheduled operations.

To run a specific function on a GPU (for advanced users, otherwise map_pipeline should accommodate most use cases), you can use the .map_gpu method. It takes two or three callables as input: the first on (prepare_batches) takes a batch of inputs and should return some tensors that will be sent to the GPU and passed to the second callable (forward), which will apply the deep learning ops and return the results. The third callable (postprocess) and gets the batch of inputs as well as the forward results and should return the final results (for instance, the input documents annotated with the predictions).

In each cases, the operations will not be executed immediately but will be scheduled to be executed when iterating of the collection, or calling the .execute, .to_* or .write_* methods.

Execution of a lazy collection

You can configure how the operations performed in the lazy collection is executed by calling its set_processing(...) method. The following options are available :

PARAMETER DESCRIPTION
batch_size

Number of documents to process at a time in a GPU worker (or in the main process if no workers are used).

TYPE: int DEFAULT: INFER

batch_by

How to compute the batch size. Can be "docs" or "words" :

  • "docs" (default) is the number of documents.
  • "words" is the total number of words in the documents.
  • "padded_words" is the total number of words in the documents, including padding, assuming the documents are padded to the same length.

TYPE: Literal['docs', 'words', 'padded_words'] DEFAULT: 'docs'

chunk_size

Number of documents to build before splitting into batches. Only used with "simple" and "multiprocessing" backends. This is also the number of documents that will be passed through the first components of the pipeline until a GPU worker is used (then the chunks will be split according to the batch_size and batch_by arguments).

By default, the chunk size is equal to the batch size, or 128 if the batch size is not set.

TYPE: int DEFAULT: INFER

sort_chunks

Whether to sort the documents by size before splitting into batches.

TYPE: bool DEFAULT: False

split_into_batches_after

The name of the component after which to split the documents into batches. Only used with "simple" and "multiprocessing" backends. By default, the documents are split into batches as soon as the input are converted into Doc objects.

TYPE: str DEFAULT: INFER

num_cpu_workers

Number of CPU workers. A CPU worker handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning components. If no GPU workers are used, the CPU workers also handle the forward call of the deep-learning components.

TYPE: Optional[int] DEFAULT: INFER

num_gpu_workers

Number of GPU workers. A GPU worker handles the forward call of the deep-learning components. Only used with "multiprocessing" backend.

TYPE: Optional[int] DEFAULT: INFER

disable_implicit_parallelism

Whether to disable OpenMP and Huggingface tokenizers implicit parallelism in multiprocessing mode. Defaults to True.

TYPE: bool DEFAULT: True

backend

The backend to use for parallel processing. If not set, the backend is automatically selected based on the input data and the number of workers.

  • "simple" is the default backend and is used when num_cpu_workers is 1 and num_gpu_workers is 0.
  • "multiprocessing" is used when num_cpu_workers is greater than 1 or num_gpu_workers is greater than 0.
  • "spark" is used when the input data is a Spark dataframe and the output writer is a Spark writer.

TYPE: Optional[Literal['simple', 'multiprocessing', 'mp', 'spark']] DEFAULT: INFER

autocast

Whether to use automatic mixed precision (AMP) for the forward pass of the deep-learning components. If True (by default), AMP will be used with the default settings. If False, AMP will not be used. If a dtype is provided, it will be passed to the torch.autocast context manager.

TYPE: Union[bool, Any] DEFAULT: INFER

show_progress

Whether to show progress bars (only applicable with "simple" and "multiprocessing" backends).

TYPE: bool DEFAULT: False

gpu_pipe_names

List of pipe names to accelerate on a GPUWorker, defaults to all pipes that inherit from TorchComponent. Only used with "multiprocessing" backend. Inferred from the pipeline if not set.

TYPE: Optional[List[str]] DEFAULT: INFER

process_start_method

Whether to use "fork" or "spawn" as the start method for the multiprocessing backend. The default is "fork" on Unix systems and "spawn" on Windows.

  • "fork" is the default start method on Unix systems and is the fastest start method, but it is not available on Windows, can cause issues with CUDA and is not safe when using multiple threads.
  • "spawn" is the default start method on Windows and is the safest start method, but it is not available on Unix systems and is slower than "fork".

TYPE: Optional[Literal['fork', 'spawn']] DEFAULT: INFER

gpu_worker_devices

List of GPU devices to use for the GPU workers. Defaults to all available devices, one worker per device. Only used with "multiprocessing" backend.

TYPE: Optional[List[str]] DEFAULT: INFER

cpu_worker_devices

List of GPU devices to use for the CPU workers. Used for debugging purposes.

TYPE: Optional[List[str]] DEFAULT: INFER

Backends

Simple backend

This is the default execution mode which batches the documents and processes each batch on the current process in a sequential manner.

Multiprocessing backend

If you have multiple CPU cores, and optionally multiple GPUs, we provide the multiprocessing backend that allows to run the inference on multiple processes.

This accelerator dispatches the batches between multiple workers (data-parallelism), and distribute the computation of a given batch on one or two workers (model-parallelism):

  • a CPUWorker which handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning components
  • a GPUWorker which handles the forward call of the deep-learning components

If no GPU is available, no GPUWorker is started, and the CPUWorkers handle the forward call of the deep-learning components as well.

The advantage of dedicating a worker to the deep-learning components is that it allows to prepare multiple batches in parallel in multiple CPUWorker, and ensure that the GPUWorker never wait for a batch to be ready.

The overall architecture described in the following figure, for 3 CPU workers and 2 GPU workers.

Here is how a small pipeline with rule-based components and deep-learning components is distributed between the workers:

Caveat

Since workers can produce their results in any order, the order of the results may not be the same as the order of the input tasks.

Spark backend

This execution mode uses Spark to parallelize the processing of the documents. The documents are first stored in a Spark DataFrame (if it was not already the case) and then processed in parallel using Spark.

Beware, if the original reader was not a SparkReader (edsnlp.data.from_spark), the local docsspark dataframe conversion might take some time, and the whole process might be slower than using the multiprocessing backend.