RelativeAttention

A self/cross-attention layer that takes relative position of elements into account to compute the attention weights. When running a relative attention layer, key and queries are represented using content and position embeddings, where position embeddings are retrieved using the relative position of keys relative to queries

Parameters

PARAMETER	DESCRIPTION
`size`	The size of the output embeddings Also serves as default if query_size, pos_size, or key_size is None TYPE: `int`
`n_heads`	The number of attention heads TYPE: `int`
`query_size`	The size of the query embeddings. TYPE: `Optional[int]` DEFAULT: `None`
`key_size`	The size of the key embeddings. TYPE: `Optional[int]` DEFAULT: `None`
`value_size`	The size of the value embeddings TYPE: `Optional[int]` DEFAULT: `None`
`head_size`	The size of each query / key / value chunk used in the attention dot product Default: `key_size / n_heads` TYPE: `Optional[int]` DEFAULT: `None`
`position_embedding`	The position embedding used as key and query embeddings TYPE: `Optional[Union[FloatTensor, Parameter]]` DEFAULT: `None`
`dropout_p`	Dropout probability applied on the attention weights Default: 0.1 TYPE: `float` DEFAULT: `0.0`
`same_key_query_proj`	Whether to use the same projection operator for content key and queries when computing the pre-attention key and query embedding chunks Default: False TYPE: `bool` DEFAULT: `False`
`same_positional_key_query_proj`	Whether to use the same projection operator for content key and queries when computing the pre-attention key and query embedding chunks Default: False TYPE: `bool` DEFAULT: `False`
`n_coordinates`	The number of positional coordinates For instance, text is 1D so 1 coordinate, images are 2D so 2 coordinates ... Default: 1 TYPE: `int` DEFAULT: `1`
`head_bias`	Whether to learn a bias term to add to the attention logits This is only useful if you plan to use the attention logits for subsequent operations, since attention weights are unaffected by bias terms. TYPE: `bool` DEFAULT: `True`
`do_pooling`	Whether to compute the output embedding. If you only plan to use attention logits, you should disable this parameter. Default: True TYPE: `bool` DEFAULT: `True`
`mode`	Whether to compute content to content (c2c), content to position (c2p) or position to content (p2c) attention terms. Setting `mode=('c2c")` disable relative position attention terms: this is the standard attention layer. To get a better intuition about these different types of attention, here is a formulation as fictitious search samples from a word in a (1D) text: content-content : "my content is ’ultrasound’ so I’m looking for other words whose content contains information about temporality" content-position: "my content is ’ultrasound’ so I’m looking for other words that are 3 positions after of me" position-content : "regardless of my content, I will attend to the word one position after from me if it contains information about temporality, two words after me if it contains information about location, etc." TYPE: `Sequence[Literal['c2c', 'c2p', 'p2c']]` DEFAULT: `('c2c', 'p2c', 'c2p')`
`n_additional_heads`	The number of additional head logits to compute. Those are not used to compute output embeddings, but may be useful in subsequent operation. Default: 0 TYPE: `int` DEFAULT: `0`

`forward`

Forward pass of the RelativeAttention layer.

PARAMETER	DESCRIPTION
`content_queries`	The content query embedding to use in the attention computation Shape: `n_samples * n_queries * query_size` TYPE: `FloatTensor`
`content_keys`	The content key embedding to use in the attention computation. If None, defaults to the `content_queries` Shape: `n_samples * n_keys * query_size` TYPE: `Optional[FloatTensor]` DEFAULT: `None`
`content_values`	The content values embedding to use in the final pooling computation. If None, pooling won't be performed. Shape: `n_samples * n_keys * query_size` TYPE: `Optional[FloatTensor]` DEFAULT: `None`
`mask`	The content key embedding to use in the attention computation. If None, defaults to the `content_queries` Shape: either - `n_samples * n_keys` - `n_samples * n_queries * n_keys` - `n_samples * n_queries * n_keys * n_heads` TYPE: `Optional[BoolTensor]` DEFAULT: `None`
`relative_positions`	The relative position of keys relative to queries If None, positional attention terms won't be computed. Shape: `n_samples * n_queries * n_keys * n_coordinates` TYPE: `Optional[LongTensor]` DEFAULT: `None`
`no_position_mask`	Key / query pairs for which the position attention terms should be disabled. Shape: `n_samples * n_queries * n_keys` TYPE: `Optional[BoolTensor]` DEFAULT: `None`
`base_attn`	Attention logits to add to the computed attention logits Shape: `n_samples * n_queries * n_keys * n_heads` TYPE: `Optional[FloatTensor]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Union[Tuple[FloatTensor, FloatTensor], FloatTensor]`	the output contextualized embeddings (only if content_values is not None and the `do_pooling` attribute is set to True) Shape: n_sample * n_keys * `size` the attention logits Shape: n_sample * n_keys * n_queries * (n_heads + n_additional_heads)