Transformer: Complete Guide — Transformer Architecture and Self-Attention
Summary — The Transformer, introduced by Vaswani et al. in the paper Attention Is All You Need (2017), is a neural network architecture that revolutionized sequence processing. Unlike RNNs and LSTMs that process data sequentially, the Transformer uses exclusively attention mechanisms to connect each element of the sequence to all others. This architecture is the foundation of BERT, GPT, T5, and all modern large language models.
Mathematical Principle
1. Scaled Dot-Product Attention
The fundamental attention is based on the Query-Key-Value formula:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$$
- Q (Query): what I’m looking for
- K (Key): what each element can offer
- V (Value): the actual content of each element
- $\sqrt{d_k}$: scaling factor to avoid saturated softmax when $d_k$ is large
The product $QK^\top$ computes the similarity between each query and each key. The softmax transforms these scores into distribution weights. Finally, these weights scale the values to produce the output.
2. Multi-Head Attention
Instead of a single attention, we use multiple heads in parallel:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, …, \text{head}_h) W^O$$
$$\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$$
Each head learns a different aspect of relationships (syntax, semantics, long-distance dependency, etc.) and the concatenation merges these perspectives.
3. Positional Encoding
Attention has no intrinsic notion of position. A positional encoding is added to the embeddings:
Even dimensions use sine and odd dimensions use cosine, according to the standard equations:
$$
\mathrm{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)
$$
$$
\mathrm{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)
$$
These sinusoidal signals allow the model to deduce the relative positions between tokens thanks to the properties of trigonometric functions.
4. Encoder-Decoder Architecture
Encoder: $N$ identical layers, each composed of:
1. Multi-Head Self-Attention
2. Add & Norm ($x + \text{Sublayer}(x)$ then LayerNorm)
3. Feed-Forward: $\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2$
4. Add & Norm
Decoder: $N$ identical layers with an additional layer:
1. Multi-Head Self-Attention (with causal mask to prevent future information leakage)
2. Multi-Head Cross-Attention (queries from decoder, keys/values from encoder)
3. Feed-Forward
4. Add & Norm at each step
5. Architectural Variants
The original Transformer is encoder-decoder, but two derived families dominate today:
| Architecture | Layers | Usage | Examples |
|---|---|---|---|
| Encoder-only | Encoder only | Understanding (classification, NER) | BERT, RoBERTa, DeBERTa |
| Decoder-only | Decoder only | Text generation | GPT, LLA*, Claude |
| Encoder-Decoder | Both | Translation, summarization | T5, BART, mBART |
Intuition
Before Transformers, RNNs and LSTMs read text word by word, like a slow reader advancing letter by letter. To understand the end of a long sentence, they had to remember the beginning through their hidden state, which caused information loss.
The Transformer, on the other hand, reads the entire sentence at once.
Think of the difference between:
– Reading a sentence by discovering it letter by letter through a pinhole (RNN)
– Seeing the entire sentence at once and instantly understanding the connections between words (Transformer)
In the sentence “The cat that the dog had been chasing since this morning took refuge in the nearest tree,” to understand what “took refuge” refers to, you need to go back to “cat.” The RNN must traverse 14 words sequentially to make this connection. The Transformer directly connects “took refuge” to “cat” in a single attention computation, regardless of the distance.
Moreover, since all tokens are processed in parallel, the Transformer is massively faster to train on GPUs than RNNs.
Python Implementation
1. Scaled Dot-Product Attention from scratch
import numpy as np
def scaled_dot_product_attention(Q, K, V, mask=None):
"""Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V"""
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # [seq_len, seq_len]
if mask is not None:
scores = np.where(mask == 0, -1e9, scores)
scores = scores - np.max(scores, axis=-1, keepdims=True)
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = weights @ V
return output, weights
# Usage example
d_k = 64
Q = np.random.randn(10, d_k) # 10 tokens, 64 dimensions
K = np.random.randn(10, d_k)
V = np.random.randn(10, d_k)
output, attn_weights = scaled_dot_product_attention(Q, K, V)
print(f"Attention weights shape: {attn_weights.shape}") # (10, 10)
2. Positional Encoding
def positional_encoding(max_len, d_model):
"""Sinusoidal positional encoding."""
pe = np.zeros((max_len, d_model))
position = np.arange(0, max_len)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
return pe
# Visualization
pe = positional_encoding(100, 512)
print(f"Positional encoding: {pe.shape}") # (100, 512)
3. Complete Transformer Encoder Layer
import numpy as np
class TransformerEncoderLayer:
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
self.d_model = d_model
self.n_heads = n_heads
self.d_ff = d_ff
self.head_dim = d_model // n_heads
self.dropout = dropout
# Projections for multi-head attention
self.W_Q = np.random.randn(d_model, d_model) * 0.01
self.W_K = np.random.randn(d_model, d_model) * 0.01
self.W_V = np.random.randn(d_model, d_model) * 0.01
self.W_O = np.random.randn(d_model, d_model) * 0.01
# Feed-forward
self.W_1 = np.random.randn(d_model, d_ff) * 0.01
self.b_1 = np.zeros(d_ff)
self.W_2 = np.random.randn(d_ff, d_model) * 0.01
self.b_2 = np.zeros(d_model)
def layer_norm(self, x):
mean = np.mean(x, axis=-1, keepdims=True)
var = np.var(x, axis=-1, keepdims=True)
return (x - mean) / np.sqrt(var + 1e-8)
def multi_head_attention(self, Q, K, V, mask=None):
batch_size = Q.shape[0]
seq_len = Q.shape[1]
Q_h = (Q @ self.W_Q).reshape(batch_size, seq_len, self.n_heads, self.head_dim)
K_h = (K @ self.W_K).reshape(batch_size, seq_len, self.n_heads, self.head_dim)
V_h = (V @ self.W_V).reshape(batch_size, seq_len, self.n_heads, self.head_dim)
Q_h = Q_h.transpose(0, 2, 1, 3)
K_h = K_h.transpose(0, 2, 1, 3)
V_h = V_h.transpose(0, 2, 1, 3)
scores = Q_h @ K_h.transpose(0, 1, 3, 2) / np.sqrt(self.head_dim)
if mask is not None:
scores = np.where(mask == 0, -1e9, scores)
weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
weights = weights / np.sum(weights, axis=-1, keepdims=True)
heads = weights @ V_h
heads = heads.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, self.d_model)
return heads @ self.W_O
def feed_forward(self, x):
"""FFN(x) = ReLU(x @ W_1 + b_1) @ W_2 + b_2"""
return np.maximum(0, x @ self.W_1 + self.b_1) @ self.W_2 + self.b_2
def __call__(self, x, mask=None):
attn_out = self.multi_head_attention(x, x, x, mask)
x = self.layer_norm(x + attn_out)
ff_out = self.feed_forward(x)
x = self.layer_norm(x + ff_out)
return x
# Demonstration
encoder = TransformerEncoderLayer(256, 8, 1024)
x = np.random.randn(2, 20, 256)
out = encoder(x)
print(f"Encoder output: {out.shape}")
4. Complete Transformer with Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
class TransformerBlock(layers.Layer):
def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
super().__init__()
self.att = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim // num_heads
)
self.ffn = keras.Sequential([layers.Dense(ff_dim, activation='relu'), layers.Dense(embed_dim)])
self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = layers.Dropout(rate)
self.dropout2 = layers.Dropout(rate)
def call(self, inputs, training=False):
attn = self.att(inputs, inputs)
attn = self.dropout1(attn, training=training)
out1 = self.layernorm1(inputs + attn)
ffn = self.ffn(out1)
ffn = self.dropout2(ffn, training=training)
return self.layernorm2(out1 + ffn)
# Complete model for text classification
vocab_size, max_len, embed_dim = 10000, 200, 256
num_heads, ff_dim, n_layers = 8, 1024, 4
inputs = keras.layers.Input(shape=(max_len,))
x = layers.Embedding(vocab_size, embed_dim)(inputs)
positions = tf.range(start=0, limit=max_len, delta=1)
x += layers.Embedding(max_len, embed_dim)(positions)
for _ in range(n_layers):
x = TransformerBlock(embed_dim, num_heads, ff_dim)(x)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
Hyperparameters
| Hyperparameter | Original Paper | Modern (LLM) | Description |
|---|---|---|---|
| $d_{model}$ | 512 | 4096-8192 | Embedding dimension |
| $n_{\text{heads}}$ | 8 | 32-64 | Number of attention heads |
| $n_{\text{layers}}$ | 6 (enc) + 6 (dec) | 32-128 | Number of stacked layers |
| $d_{ff}$ | 2048 | 16384 | Feed-forward dimension |
dropout |
0.1 | 0.0-0.1 | Regularization |
| $L_{\text{max}}$ | 512 | 4096-128000 | Maximum sequence length |
Advantages of the Transformer
- Massive parallelization: Unlike RNNs that process tokens one by one, the Transformer processes the entire sequence in parallel. Training is 5 to 10 times faster on GPU.
- Long-range dependencies: The distance between two tokens does not affect the number of operations to connect them.
- Universality: The same architecture works for translation, classification, generation, vision (ViT), audio (Whisper), and even tabular data.
- Scalability: Performance continues to improve with more data, more parameters, and more compute. This is the property that enabled the scaling of modern LLMs.
Limitations of the Transformer
- Quadratic complexity: Attention costs $O(n^2)$ in memory and computation relative to sequence length. For sequences of 100,000 tokens, this becomes prohibitive.
- Energy consumption: Training a 175-billion-parameter Transformer consumes as much energy as hundreds of households for a year.
- Colossal data requirements: Transformers reach their full potential only with datasets of billions of tokens.
- Interpretability: Attention weights do not always correspond to the model’s actual linguistic dependencies.
4 Concrete Use Cases
1. Machine Translation (Google Translate)
The original Transformer was designed for English-German and English-French translation. The encoder reads the source sentence, the decoder generates the target sentence token by token with a causal mask. Results surpassed all RNN-based systems and are still used in modern translation engines.
2. Legal Document Classification
A law firm uses an encoder-only Transformer (such as BERT or DeBERTa) to automatically classify thousands of contracts by type (NDA, lease, employment contract) and extract specific clauses. Fine-tuning requires only a few hundred annotated examples.
3. Generative Language Model (GPT)
GPT-type models use a decoder-only Transformer trained on trillions of tokens to predict the next token. Autoregressive generation produces coherent text over thousands of words, capable of writing, translating, coding, and reasoning.
4. Vision Transformer (ViT)
The Vision Transformer (ViT) applies the Transformer architecture to images by cutting them into patches (like visual tokens). A ViT model pre-trained on ImageNet achieves performance comparable to state-of-the-art CNNs, with the advantage of better scalability at large scale.
Conclusion
The Transformer is unquestionably the most influential machine learning architecture of this decade. By replacing recurrence with attention, it enabled a leap forward in almost all areas of sequence processing. From BERT to GPT, from machine translation to generative language models, this unique architecture has demonstrated an unprecedented capacity for generalization.
Even though research is now exploring alternatives (Mamba, RWKV, linear state architectures) to reduce quadratic complexity, the Transformer remains the industry standard and will continue to dominate for years to come.
See Also
- Python: Learning to Use pandas
- Creating Crossed Ellipses in Python: Complete Guide for Visualization and Manipulation

