Transformer
The eds.transformer
component is a wrapper around HuggingFace's transformers library. If you are not familiar with transformers, a good way to start is the Illustrated Transformer tutorial.
Compared to using the raw Huggingface model, we offer a simple mechanism to split long documents into strided windows before feeding them to the model.
Windowing
EDS-NLP's Transformer component splits long documents into smaller windows before feeding them to the model. This is done to avoid hitting the maximum number of tokens that can be processed by the model on a single device. The window size and stride can be configured using the window
and stride
parameters. The default values are 512 and 256 respectively, which means that the model will process windows of 512 tokens, each separated by 256 tokens. Whenever a token appears in multiple windows, the embedding of the "most contextualized" occurrence is used, i.e. the occurrence that is the closest to the center of its window.
Here is an overview how this works to produce embeddings (shown in red) for each word of the document :
Examples
Here is an example of how to define a pipeline with a Transformer component:
import edsnlp, edsnlp.pipes as eds
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.transformer(
model="prajjwal1/bert-tiny",
window=128,
stride=96,
),
)
You can then compose this embedding with a task specific component such as eds.ner_crf
.
Parameters
PARAMETER | DESCRIPTION |
---|---|
nlp | The pipeline instance TYPE: |
name | The component name TYPE: |
model | The Huggingface model name or path TYPE: |
window | The window size to use when splitting long documents into smaller windows before feeding them to the Transformer model (default: 512 = 512 - 2) TYPE: |
stride | The stride (distance between windows) to use when splitting long documents into smaller windows: (default: 96) TYPE: |
training_stride | If False, the stride will be set to the window size during training, meaning that there will be no overlap between windows. If True, the stride will be set to the TYPE: |
max_tokens_per_device | The maximum number of tokens that can be processed by the model on a single device. This does not affect the results but can be used to reduce the memory usage of the model, at the cost of a longer processing time. If "auto", the component will try to estimate the maximum number of tokens that can be processed by the model on the current device at a given time. TYPE: |
span_getter | Which spans of the document should be embedded. Defaults to the full document if None. TYPE: |