Transformer: Transformer Architecture and Self-Attention

Transformer : Guide Complet — Architecture Transformer et Self-Attention

Transformer: Complete Guide — Transformer Architecture and Self-Attention

Summary — The Transformer, introduced by Vaswani et al. in the paper Attention Is All You Need (2017), is a neural network architecture that revolutionized sequence processing. Unlike RNNs and LSTMs that process data sequentially, the Transformer uses exclusively attention mechanisms to connect each element of the sequence to all others. This architecture is the foundation of BERT, GPT, T5, and all modern large language models.


Mathematical Principle

1. Scaled Dot-Product Attention

The fundamental attention is based on the Query-Key-Value formula:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$$

  • Q (Query): what I’m looking for
  • K (Key): what each element can offer
  • V (Value): the actual content of each element
  • $\sqrt{d_k}$: scaling factor to avoid saturated softmax when $d_k$ is large

The product $QK^\top$ computes the similarity between each query and each key. The softmax transforms these scores into distribution weights. Finally, these weights scale the values to produce the output.

2. Multi-Head Attention

Instead of a single attention, we use multiple heads in parallel:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, …, \text{head}_h) W^O$$

$$\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$$

Each head learns a different aspect of relationships (syntax, semantics, long-distance dependency, etc.) and the concatenation merges these perspectives.

3. Positional Encoding

Attention has no intrinsic notion of position. A positional encoding is added to the embeddings:

Even dimensions use sine and odd dimensions use cosine, according to the standard equations:

$$
\mathrm{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)
$$

$$
\mathrm{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)
$$

These sinusoidal signals allow the model to deduce the relative positions between tokens thanks to the properties of trigonometric functions.

4. Encoder-Decoder Architecture

Encoder: $N$ identical layers, each composed of:
1. Multi-Head Self-Attention
2. Add & Norm ($x + \text{Sublayer}(x)$ then LayerNorm)
3. Feed-Forward: $\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2$
4. Add & Norm

Decoder: $N$ identical layers with an additional layer:
1. Multi-Head Self-Attention (with causal mask to prevent future information leakage)
2. Multi-Head Cross-Attention (queries from decoder, keys/values from encoder)
3. Feed-Forward
4. Add & Norm at each step

5. Architectural Variants

The original Transformer is encoder-decoder, but two derived families dominate today:

Architecture Layers Usage Examples
Encoder-only Encoder only Understanding (classification, NER) BERT, RoBERTa, DeBERTa
Decoder-only Decoder only Text generation GPT, LLA*, Claude
Encoder-Decoder Both Translation, summarization T5, BART, mBART

Intuition

Before Transformers, RNNs and LSTMs read text word by word, like a slow reader advancing letter by letter. To understand the end of a long sentence, they had to remember the beginning through their hidden state, which caused information loss.

The Transformer, on the other hand, reads the entire sentence at once.

Think of the difference between:
– Reading a sentence by discovering it letter by letter through a pinhole (RNN)
– Seeing the entire sentence at once and instantly understanding the connections between words (Transformer)

In the sentence “The cat that the dog had been chasing since this morning took refuge in the nearest tree,” to understand what “took refuge” refers to, you need to go back to “cat.” The RNN must traverse 14 words sequentially to make this connection. The Transformer directly connects “took refuge” to “cat” in a single attention computation, regardless of the distance.

Moreover, since all tokens are processed in parallel, the Transformer is massively faster to train on GPUs than RNNs.


Python Implementation

1. Scaled Dot-Product Attention from scratch

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V"""
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)  # [seq_len, seq_len]

    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)

    scores = scores - np.max(scores, axis=-1, keepdims=True)
    weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    output = weights @ V
    return output, weights

# Usage example
d_k = 64
Q = np.random.randn(10, d_k)  # 10 tokens, 64 dimensions
K = np.random.randn(10, d_k)
V = np.random.randn(10, d_k)

output, attn_weights = scaled_dot_product_attention(Q, K, V)
print(f"Attention weights shape: {attn_weights.shape}")  # (10, 10)

2. Positional Encoding

def positional_encoding(max_len, d_model):
    """Sinusoidal positional encoding."""
    pe = np.zeros((max_len, d_model))
    position = np.arange(0, max_len)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    pe[:, 0::2] = np.sin(position * div_term)
    pe[:, 1::2] = np.cos(position * div_term)
    return pe

# Visualization
pe = positional_encoding(100, 512)
print(f"Positional encoding: {pe.shape}")  # (100, 512)

3. Complete Transformer Encoder Layer

import numpy as np

class TransformerEncoderLayer:
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_ff = d_ff
        self.head_dim = d_model // n_heads
        self.dropout = dropout

        # Projections for multi-head attention
        self.W_Q = np.random.randn(d_model, d_model) * 0.01
        self.W_K = np.random.randn(d_model, d_model) * 0.01
        self.W_V = np.random.randn(d_model, d_model) * 0.01
        self.W_O = np.random.randn(d_model, d_model) * 0.01

        # Feed-forward
        self.W_1 = np.random.randn(d_model, d_ff) * 0.01
        self.b_1 = np.zeros(d_ff)
        self.W_2 = np.random.randn(d_ff, d_model) * 0.01
        self.b_2 = np.zeros(d_model)

    def layer_norm(self, x):
        mean = np.mean(x, axis=-1, keepdims=True)
        var = np.var(x, axis=-1, keepdims=True)
        return (x - mean) / np.sqrt(var + 1e-8)

    def multi_head_attention(self, Q, K, V, mask=None):
        batch_size = Q.shape[0]
        seq_len = Q.shape[1]
        Q_h = (Q @ self.W_Q).reshape(batch_size, seq_len, self.n_heads, self.head_dim)
        K_h = (K @ self.W_K).reshape(batch_size, seq_len, self.n_heads, self.head_dim)
        V_h = (V @ self.W_V).reshape(batch_size, seq_len, self.n_heads, self.head_dim)
        Q_h = Q_h.transpose(0, 2, 1, 3)
        K_h = K_h.transpose(0, 2, 1, 3)
        V_h = V_h.transpose(0, 2, 1, 3)
        scores = Q_h @ K_h.transpose(0, 1, 3, 2) / np.sqrt(self.head_dim)
        if mask is not None:
            scores = np.where(mask == 0, -1e9, scores)
        weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
        weights = weights / np.sum(weights, axis=-1, keepdims=True)
        heads = weights @ V_h
        heads = heads.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, self.d_model)
        return heads @ self.W_O

    def feed_forward(self, x):
        """FFN(x) = ReLU(x @ W_1 + b_1) @ W_2 + b_2"""
        return np.maximum(0, x @ self.W_1 + self.b_1) @ self.W_2 + self.b_2

    def __call__(self, x, mask=None):
        attn_out = self.multi_head_attention(x, x, x, mask)
        x = self.layer_norm(x + attn_out)
        ff_out = self.feed_forward(x)
        x = self.layer_norm(x + ff_out)
        return x

# Demonstration
encoder = TransformerEncoderLayer(256, 8, 1024)
x = np.random.randn(2, 20, 256)
out = encoder(x)
print(f"Encoder output: {out.shape}")

4. Complete Transformer with Keras

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim // num_heads
        )
        self.ffn = keras.Sequential([layers.Dense(ff_dim, activation='relu'), layers.Dense(embed_dim)])
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training=False):
        attn = self.att(inputs, inputs)
        attn = self.dropout1(attn, training=training)
        out1 = self.layernorm1(inputs + attn)
        ffn = self.ffn(out1)
        ffn = self.dropout2(ffn, training=training)
        return self.layernorm2(out1 + ffn)

# Complete model for text classification
vocab_size, max_len, embed_dim = 10000, 200, 256
num_heads, ff_dim, n_layers = 8, 1024, 4

inputs = keras.layers.Input(shape=(max_len,))
x = layers.Embedding(vocab_size, embed_dim)(inputs)
positions = tf.range(start=0, limit=max_len, delta=1)
x += layers.Embedding(max_len, embed_dim)(positions)

for _ in range(n_layers):
    x = TransformerBlock(embed_dim, num_heads, ff_dim)(x)

x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Hyperparameters

Hyperparameter Original Paper Modern (LLM) Description
$d_{model}$ 512 4096-8192 Embedding dimension
$n_{\text{heads}}$ 8 32-64 Number of attention heads
$n_{\text{layers}}$ 6 (enc) + 6 (dec) 32-128 Number of stacked layers
$d_{ff}$ 2048 16384 Feed-forward dimension
dropout 0.1 0.0-0.1 Regularization
$L_{\text{max}}$ 512 4096-128000 Maximum sequence length

Advantages of the Transformer

  1. Massive parallelization: Unlike RNNs that process tokens one by one, the Transformer processes the entire sequence in parallel. Training is 5 to 10 times faster on GPU.
  2. Long-range dependencies: The distance between two tokens does not affect the number of operations to connect them.
  3. Universality: The same architecture works for translation, classification, generation, vision (ViT), audio (Whisper), and even tabular data.
  4. Scalability: Performance continues to improve with more data, more parameters, and more compute. This is the property that enabled the scaling of modern LLMs.

Limitations of the Transformer

  1. Quadratic complexity: Attention costs $O(n^2)$ in memory and computation relative to sequence length. For sequences of 100,000 tokens, this becomes prohibitive.
  2. Energy consumption: Training a 175-billion-parameter Transformer consumes as much energy as hundreds of households for a year.
  3. Colossal data requirements: Transformers reach their full potential only with datasets of billions of tokens.
  4. Interpretability: Attention weights do not always correspond to the model’s actual linguistic dependencies.

4 Concrete Use Cases

1. Machine Translation (Google Translate)

The original Transformer was designed for English-German and English-French translation. The encoder reads the source sentence, the decoder generates the target sentence token by token with a causal mask. Results surpassed all RNN-based systems and are still used in modern translation engines.

2. Legal Document Classification

A law firm uses an encoder-only Transformer (such as BERT or DeBERTa) to automatically classify thousands of contracts by type (NDA, lease, employment contract) and extract specific clauses. Fine-tuning requires only a few hundred annotated examples.

3. Generative Language Model (GPT)

GPT-type models use a decoder-only Transformer trained on trillions of tokens to predict the next token. Autoregressive generation produces coherent text over thousands of words, capable of writing, translating, coding, and reasoning.

4. Vision Transformer (ViT)

The Vision Transformer (ViT) applies the Transformer architecture to images by cutting them into patches (like visual tokens). A ViT model pre-trained on ImageNet achieves performance comparable to state-of-the-art CNNs, with the advantage of better scalability at large scale.


Conclusion

The Transformer is unquestionably the most influential machine learning architecture of this decade. By replacing recurrence with attention, it enabled a leap forward in almost all areas of sequence processing. From BERT to GPT, from machine translation to generative language models, this unique architecture has demonstrated an unprecedented capacity for generalization.

Even though research is now exploring alternatives (Mamba, RWKV, linear state architectures) to reduce quadratic complexity, the Transformer remains the industry standard and will continue to dominate for years to come.


See Also