Self-attention is the heart of the Transformer. It allows every token in a sequence to attend to every other token (including itself), capturing contextual relationships in parallel.
For an input sequence of tokens represented as matrix X (shape: sequence_length × d_model), self-attention works as follows:
Project the input into three matrices using learned linear transformations:
Query matrix: Q = X W^Q
Key matrix: K = X W^K
Value matrix: V = X W^V
(W^Q, W^K ∈ ℝ^{d_model × d_k}, W^V ∈ ℝ^{d_model × d_v}; usually d_k = d_v = d_model / h)
Compute raw attention scores via dot product:
Scores = Q K^T (shape: seq_len × seq_len)
Scale by √d_k to stabilize gradients (prevents vanishing/exploding when d_k is large):
Scaled Scores = Q K^T / √d_k
Apply optional masking (e.g., in decoder to prevent future peeking).
Normalize with softmax to get attention weights (probabilities):
Attention Weights = softmax(Scaled Scores)
Weighted sum of values:
Attention(Q, K, V) = Attention Weights × V
This produces a new representation for each token that is a context-weighted combination of all tokens’ values.
Here is the mathematical formulation:
Attention(Q, K, V) = softmax( (Q K^T) / √d_k ) V
Instead of performing attention once with d_model-dimensional vectors, the Transformer splits the attention into h parallel “heads” (typically h=8 or 16). Each head operates in a lower-dimensional subspace (d_k = d_v = d_model / h).
Multi-Head Attention(Q, K, V) = Concat(head_1, …, head_h) W^O
Where head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)
Allows the model to jointly attend to information from different representation subspaces (syntax, semantics, coreference, etc.). Improves expressivity and generalization.