Skip to content

RelativeAttention

A self/cross-attention layer that takes relative position of elements into account to compute the attention weights. When running a relative attention layer, key and queries are represented using content and position embeddings, where position embeddings are retrieved using the relative position of keys relative to queries

Parameters

PARAMETER DESCRIPTION
size

The size of the output embeddings Also serves as default if query_size, pos_size, or key_size is None

TYPE: int

n_heads

The number of attention heads

TYPE: int

query_size

The size of the query embeddings.

TYPE: Optional[int] DEFAULT: None

key_size

The size of the key embeddings.

TYPE: Optional[int] DEFAULT: None

value_size

The size of the value embeddings

TYPE: Optional[int] DEFAULT: None

head_size

The size of each query / key / value chunk used in the attention dot product Default: key_size / n_heads

TYPE: Optional[int] DEFAULT: None

position_embedding

The position embedding used as key and query embeddings

TYPE: Optional[Union[FloatTensor, Parameter]] DEFAULT: None

dropout_p

Dropout probability applied on the attention weights Default: 0.1

TYPE: float DEFAULT: 0.0

same_key_query_proj

Whether to use the same projection operator for content key and queries when computing the pre-attention key and query embedding chunks Default: False

TYPE: bool DEFAULT: False

same_positional_key_query_proj

Whether to use the same projection operator for content key and queries when computing the pre-attention key and query embedding chunks Default: False

TYPE: bool DEFAULT: False

n_coordinates

The number of positional coordinates For instance, text is 1D so 1 coordinate, images are 2D so 2 coordinates ... Default: 1

TYPE: int DEFAULT: 1

head_bias

Whether to learn a bias term to add to the attention logits This is only useful if you plan to use the attention logits for subsequent operations, since attention weights are unaffected by bias terms.

TYPE: bool DEFAULT: True

do_pooling

Whether to compute the output embedding. If you only plan to use attention logits, you should disable this parameter. Default: True

TYPE: bool DEFAULT: True

mode

Whether to compute content to content (c2c), content to position (c2p) or position to content (p2c) attention terms. Setting mode=('c2c") disable relative position attention terms: this is the standard attention layer. To get a better intuition about these different types of attention, here is a formulation as fictitious search samples from a word in a (1D) text:

  • content-content : "my content is ’ultrasound’ so I’m looking for other words whose content contains information about temporality"
  • content-position: "my content is ’ultrasound’ so I’m looking for other words that are 3 positions after of me"
  • position-content : "regardless of my content, I will attend to the word one position after from me if it contains information about temporality, two words after me if it contains information about location, etc."

TYPE: Sequence[Literal['c2c', 'c2p', 'p2c']] DEFAULT: ('c2c', 'p2c', 'c2p')

n_additional_heads

The number of additional head logits to compute. Those are not used to compute output embeddings, but may be useful in subsequent operation. Default: 0

TYPE: int DEFAULT: 0

forward

Forward pass of the RelativeAttention layer.

PARAMETER DESCRIPTION
content_queries

The content query embedding to use in the attention computation Shape: n_samples * n_queries * query_size

TYPE: FloatTensor

content_keys

The content key embedding to use in the attention computation. If None, defaults to the content_queries Shape: n_samples * n_keys * query_size

TYPE: Optional[FloatTensor] DEFAULT: None

content_values

The content values embedding to use in the final pooling computation. If None, pooling won't be performed. Shape: n_samples * n_keys * query_size

TYPE: Optional[FloatTensor] DEFAULT: None

mask

The content key embedding to use in the attention computation. If None, defaults to the content_queries Shape: either - n_samples * n_keys - n_samples * n_queries * n_keys - n_samples * n_queries * n_keys * n_heads

TYPE: Optional[BoolTensor] DEFAULT: None

relative_positions

The relative position of keys relative to queries If None, positional attention terms won't be computed. Shape: n_samples * n_queries * n_keys * n_coordinates

TYPE: Optional[LongTensor] DEFAULT: None

no_position_mask

Key / query pairs for which the position attention terms should be disabled. Shape: n_samples * n_queries * n_keys

TYPE: Optional[BoolTensor] DEFAULT: None

base_attn

Attention logits to add to the computed attention logits Shape: n_samples * n_queries * n_keys * n_heads

TYPE: Optional[FloatTensor] DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple[FloatTensor, FloatTensor], FloatTensor]
  • the output contextualized embeddings (only if content_values is not None and the do_pooling attribute is set to True) Shape: n_sample * n_keys * size
  • the attention logits Shape: n_sample * n_keys * n_queries * (n_heads + n_additional_heads)