Deploying on Spark
We provide a simple connector to distribute a pipeline on a Spark cluster. We expose a Spark UDF (user-defined function) factory that handles the nitty gritty of distributing a pipeline over a cluster of Spark-enabled machines.
Distributing a pipeline
Because of the way Spark distributes Python objects, we need to re-declare custom extensions on the executors. To make this step as smooth as possible, EDS-NLP provides a BaseComponent
class that implements a set_extensions
method. When the pipeline is distributed, every component that extend BaseComponent
rerun their set_extensions
method.
Since spaCy Doc
objects cannot easily be serialised, the UDF we provide returns a list of detected entities along with selected qualifiers.
Example
See the dedicated tutorial for a step-by-step presentation.
Authors and citation
The Spark connector was developed by AP-HP's Data Science team.