GPT: Complete Guide — Generative Pre-trained Transformer
Summary
GPT (Generative Pre-trained Transformer) is a family of language models developed by OpenAI, based on the decoder-only Transformer architecture. Since the publication of GPT-1 in 2018, this family has continuously evolved, with GPT-4 achieving remarkable reasoning and text generation capabilities. Unlike BERT, which uses a bidirectional approach, GPT learns to predict the next word in a sequence through self-supervised learning. This guide explores in depth the mathematical principles, practical implementation, and use cases of this revolutionary architecture that has transformed natural language processing.
Mathematical Principle
The Autoregressive Language Model
At the heart of GPT lies an elegant mathematical idea: modeling the joint probability of a word sequence as a product of conditional probabilities. Mathematically, for a word sequence X = (x_1, x_2, …, x_n), we write:
P(X) = P(x_1) × P(x_2 | x_1) × P(x_3 | x_1, x_2) × ... × P(x_n | x_1, x_2, ..., x_{n-1})
Which factorizes as:
P(X) = ∏_{i=1}^{n} P(x_i | x_1, x_2, ..., x_{i-1})
This decomposition is fundamental. Each word is predicted conditionally on all the words that precede it in the sequence. This is called Causal Language Modeling (CLM). The term “causal” means the model only has access to previous tokens — never future tokens.
Loss Function: Cross-Entropy
The training of GPT is based on minimizing the cross-entropy loss function over the entire sequence. For a given sequence X and a model parameterized by θ, the loss is written as:
L(θ) = -∑_{i=1}^{n} log P(x_i | x_1, ..., x_{i-1}; θ)
Each term in the sum measures the gap between the true next-word distribution and the distribution predicted by the model. Minimizing this loss is equivalent to maximizing the likelihood of the pre-training data.
Decoding Strategies
Text generation with GPT can use several strategies:
- Greedy decoding: at each step, the model selects the token with the highest probability. This approach produces coherent but sometimes repetitive or predictable text.
- Beam search: the model explores k candidate sequences in parallel at each step (where k is the beam width). It keeps the k best partial hypotheses and completes them iteratively. At the end, the sequence with the highest overall score is returned. This method is particularly useful for deterministic tasks like machine translation.
- Top-k sampling: the model only considers the k most probable tokens and redistributes their probabilities proportionally. This avoids very unlikely tokens while introducing some variety in the generation.
- Top-p sampling (nucleus sampling): instead of choosing a fixed number k of tokens, the smallest set of tokens whose cumulative probability reaches a threshold p is selected. The model then samples from this set. This approach is more adaptive than top-k: for an ambiguous context, many tokens will be considered, while for a clear context, the choice naturally narrows.
The basic softmax formula that transforms logits z into probabilities is written as:
P(w) = exp(z_w / T) / ∑_{j} exp(z_j / T)
where T is the temperature, a crucial hyperparameter that controls the “sharpness” of the distribution.
Intuition: GPT as a Scrabble Player
Imagine an exceptionally well-read Scrabble player. Before placing a letter, they mentally anticipate every possible word. They know the spelling of hundreds of thousands of words in multiple languages by heart. They understand that after “the,” a noun usually follows; that after “because,” a subordinate clause is coming.
That is exactly what GPT does, but on a dizzying scale. It has read billions of pages of text — Wikipedia, books, scientific articles, forums, websites — and has learned the statistical patterns that connect words. When you give it the beginning of a sentence, it uses this colossal knowledge to predict the most likely continuation.
The temperature controls the creativity of its predictions:
- Low temperature (T ≈ 0.1): the model is very confident and deterministic. It will almost always choose the most probable word. The text will be coherent but perhaps repetitive. It is like an ultra-conservative Scrabble player who always plays the word with the maximum score.
- Medium temperature (T ≈ 0.7): the model explores reasonably likely alternatives. The text becomes more natural and varied, reflecting the diversity of human language.
- High temperature (T ≈ 1.5): the model takes risks. It can produce original and creative passages, but also inconsistencies or nonsense. It is a bold Scrabble player who invents words — sometimes brilliant, sometimes absurd.
This “next word” intuition is remarkably powerful. It suffices, when fueled by enough data and parameters, to produce complex reasoning, functional computer code, and nuanced text.
Python Implementation
Generation with Hugging Face pipeline
The simplest way to use GPT is Hugging Face’s generation pipeline:
from transformers import pipeline
# Load the text generation pipeline
generate_text = pipeline(
"text-generation",
model="gpt2",
tokenizer="gpt2",
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
do_sample=True
)
# Text generation
prompt = "Artificial intelligence is transforming our society by"
result = generate_text(prompt)
print(result[0]["generated_text"])
Configuring Generation Parameters
GPT parameters offer fine-grained control over the generation process:
from transformers import pipeline
generator = pipeline(
"text-generation",
model="gpt2",
max_new_tokens=200,
temperature=0.8,
top_p=0.95,
top_k=50,
do_sample=True,
repetition_penalty=1.2,
return_full_text=False
)
prompt = "The latest advances in deep learning"
output = generator(prompt, num_return_sequences=3)
for i, gen in enumerate(output):
print(f"--- Sequence {i + 1} ---")
print(gen["generated_text"])
print()
Manual Generation Loop with Logits
For total control over GPT and a deep understanding of the generation process, you can write the token-by-token prediction loop manually:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load the model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
def generate_manually(prompt, max_steps=50, temperature=0.8, top_p=0.95):
"""
Manual generation loop for GPT.
Fine control over logits and sampling.
"""
# Encode the prompt
ids = tokenizer.encode(prompt, return_tensors="pt")
for step in range(max_steps):
# Forward pass: obtain logits
with torch.no_grad():
output = model(ids)
logits = output.logits[:, -1, :] # Last token only
# Apply temperature
logits = logits / temperature
# Top-p sampling (nucleus sampling)
probs = torch.softmax(logits, dim=-1)
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
# Mask tokens whose cumulative probability exceeds top_p
mask = cumulative_probs > top_p
mask[..., 1:] = mask[..., :-1].clone()
mask[..., 0] = False
filtered_probs = sorted_probs.masked_fill(mask, 0.0)
filtered_probs = filtered_probs / filtered_probs.sum()
# Sampling
next_token = torch.multinomial(filtered_probs, 1)
next_token_index = sorted_indices.gather(1, next_token)
# Append token to sequence
ids = torch.cat([ids, next_token_index], dim=-1)
# Extract generated text for display
generated_text = tokenizer.decode(ids[0], skip_special_tokens=True)
print(f"Step {step + 1}: {tokenizer.decode(next_token_index[0])}")
return generated_text
# Run
result = generate_manually(
"In machine learning,",
max_steps=30,
temperature=0.85,
top_p=0.9
)
print(f"\nComplete text:\n{result}")
Feature Extraction with GPT
GPT models can also be used to extract vector representations (embeddings) from text, usable for many downstream tasks:
import torch
from transformers import GPT2Tokenizer, GPT2Model
# Load the base model (without language head)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")
def extract_features(text):
"""
Feature extraction with GPT.
Returns averaged hidden states as text embedding.
"""
# Encode
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
# Forward pass
with torch.no_grad():
outputs = model(**inputs)
hidden_states = outputs.last_hidden_state # [1, seq_len, hidden_dim]
# Mean pooling over the sequence dimension
embedding = hidden_states.mean(dim=1) # [1, hidden_dim]
return embedding.squeeze()
# Usage example
text = "Transfer learning is a powerful technique."
vector = extract_features(text)
print(f"Embedding dimension: {vector.shape}")
print(f"First values: {vector[:5]}")
This feature extraction approach is particularly useful for tasks like text classification, semantic similarity detection, or document clustering.
Hyperparameters
GPT hyperparameters directly influence the quality and style of generation:
| Hyperparameter | Description | Typical Value |
|---|---|---|
| temperature | Controls creativity. The higher the temperature, the more varied and unpredictable the generation. | 0.7 – 1.0 |
| top_p | Cumulative probability threshold for nucleus sampling. Reduces the vocabulary space to the most probable tokens. | 0.9 – 0.95 |
| top_k | Maximum number of tokens considered at each step. Useful in combination with top_p. | 40 – 100 |
| max_new_tokens | Maximum number of tokens generated after the prompt. Limits response length. | 50 – 500 |
| repetition_penalty | Penalty applied to already-generated tokens. Reduces unwanted repetitions. | 1.0 – 1.5 |
| no_repeat_ngram_size | Size of n-grams forbidden from repeating. For example, 3 prevents any 3-word sequence from repeating. | 2 – 4 |
| do_sample | Enables or disables stochastic sampling. If False, uses greedy decoding. | True |
| num_beams | Number of beams for beam search. Ignored if do_sample=True. | 1 – 10 |
Tuning these hyperparameters depends heavily on the task. For faithful text summarization, a low temperature and high top_k are preferred. For literary creation, the temperature is increased and top_p is adjusted to balance creativity and coherence.
Advantages and Limitations
Advantages of GPT
- Fluent text generation: GPT excels at producing natural, coherent, and grammatically correct text, thanks to its training on massive corpora.
- In-context learning: Recent versions of GPT can learn to accomplish new tasks simply from examples provided in the prompt, without adjusting the model’s weights.
- Versatility: A single model generates text, answers questions, writes code, translates, and summarizes. This multi-task capability is unique.
- Scaling laws: GPT performance follows predictable scaling laws: the more you increase the model size, data volume, and compute power, the better the results.
- Zero-shot and few-shot: The model works reasonably well without any examples (zero-shot) and improves significantly with a few demonstrations (few-shot).
Limitations of GPT
- Hallucinations: GPT can generate false but plausible information. It doesn’t “know” what is true; it simply produces the most likely continuation according to its training statistics.
- Absence of true reasoning: The model does not perform logical reasoning in the formal sense. It imitates reasoning patterns learned from its data, which can fail on problems requiring rigorous deduction.
- Prompt sensitivity: Minor variations in prompt formulation can produce very different results. This fragility makes reproducibility difficult.
- Computational cost: Training and inference of GPT require significant hardware resources (high-performance GPUs, memory), limiting access to organizations with substantial means.
- Data bias: The model reproduces and amplifies biases present in its training data, which poses serious ethical problems in real-world applications.
- Limited context window: Despite recent improvements, the context size remains finite. Information beyond this window is simply “forgotten” by the model.
4 Concrete Use Cases
1. Generation and Writing Assistance
GPT is widely used for content creation: blog posts, commercial writing, poetic creation. The user provides a theme and a few guidelines, and the model produces a coherent and structured first draft. This application considerably reduces writing time while maintaining acceptable quality for a draft.
from transformers import pipeline
writer = pipeline(
"text-generation",
model="gpt2",
max_new_tokens=200,
temperature=0.8
)
theme = "The advantages of federated learning in healthcare:"
draft = writer(theme)
print(draft[0]["generated_text"])
2. Software Development Assistance
Advanced versions of GPT understand and generate code in multiple languages. They can explain complex functions, detect bugs, propose corrections, and even convert code from one language to another. This capability transforms developers’ day-to-day productivity.
3. Automatic Document Summarization
GPT can summarize long documents into a few condensed paragraphs. By feeding the model a complete article and a prompt like “Summarize this article in three key points:”, a quick and accurate synthesis is obtained. This is particularly useful for technological monitoring, academic literature reviews, or news monitoring.
4. Chatbots and Virtual Assistants
The GPT architecture is ideal for conversational systems. Thanks to its ability to maintain dialogue context and produce natural responses, it enables the creation of assistants capable of handling complex conversations, answering precise questions, and adapting their tone to the context of the exchange.
# Example: simple chatbot with conditioned generation
history = "User: Hello, can you help me with Python?\nAssistant: "
chatbot = pipeline("text-generation", model="gpt2", max_new_tokens=100, temperature=0.7)
response = chatbot(history)
print(response[0]["generated_text"])
Conclusion
GPT represents a major advancement in artificial intelligence. Its decoder-only architecture, autoregressive learning based on cross-entropy, and sophisticated decoding strategies make it an indispensable tool for natural language processing. With Hugging Face, access to these models is now within reach of all Python developers. However, it remains important to be aware of its limitations — hallucinations, biases, and computational cost — and to use it responsibly.
The future of GPT looks promising: current research is exploring context window extension, reducing hallucinations through reinforcement learning from human feedback (RLHF), and integrating multimodal capabilities. The NLP landscape continues to evolve rapidly, and GPT will remain at the heart of this transformation.
See Also
- Computing the Totient of a Square in Cube with Python: Practical Guide and Tips
- Creating Crossed Ellipses in Python: Complete Guide for Visualization and Manipulation

