BERT: Complete Guide — Bidirectional Language Models
Overview
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model developed by Google in 2018, which fundamentally transformed the field of natural language processing (NLP). Unlike earlier architectures that processed text unidirectionally, BERT reads context from both sides simultaneously through its bidirectional self-attention mechanism. This innovation enabled record-breaking performance on numerous tasks: text classification, question answering, named entity recognition, semantic similarity analysis, and many more. Published in the landmark paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT relies on two complementary training strategies — Masked Language Modeling and Next Sentence Prediction — that together produce linguistic representations of unprecedented richness. This guide explores the theoretical foundations, architecture, practical Python implementation, and concrete applications of BERT.
Mathematical Principle
Masked Language Modeling (MLM)
The core of BERT’s innovation lies in Masked Language Modeling. The idea is both simple and powerful: instead of predicting the next word in a sequence (the traditional language modeling approach), BERT randomly masks 15% of input tokens and learns to reconstruct the masked tokens using the complete context of the sentence.
Formally, given a sequence of tokens W = [w_1, w_2, …, w_N], we define a mask M where each position i has a probability p = 0.15 of being masked. The model optimizes the following likelihood function:
L_MLM = – sum_over_i_in_M [ log P(w_i | W_without_M) ]
where W_without_M represents the sequence with masked tokens replaced by the special [MASK] token.
The 15% masking strategy is broken down more precisely as follows:
- 80% of the time: the token is replaced by
[MASK] - 10% of the time: the token is replaced by a random token from the vocabulary
- 10% of the time: the token is left as-is
This clever distribution avoids the gap between pre-training and fine-tuning: since the [MASK] token never appears during fine-tuning, BERT does not become dependent on this artificial signal. The random substitution forces the model to remain uncertain and distribute its probabilities more robustly.
Next Sentence Prediction (NSP)
The second training task is Next Sentence Prediction. Given two sentences A and B, the model must predict whether B actually follows A in the original text (label IsNext) or whether B comes from a different document (label NotNext).
L_NSP = – log P(IsNext | A, B)
The total loss function combines both objectives:
L_total = L_MLM + L_NSP
This task teaches BERT the logical relationship between successive sentences, which proves particularly useful for applications like question answering and natural language inference (NLI), where understanding coherence between statements is essential.
Architecture: Stacked Encoder Layers
BERT’s architecture relies exclusively on stacked Transformer Encoder layers, with no decoder component. Unlike the original Transformer by Vaswani et al. (2017) which used an encoder and decoder for machine translation, BERT retains only the encoder portion.
Two main variants exist:
| Parameter | BERT-Base | BERT-Large |
|---|---|---|
| Layers (L) | 12 | 24 |
| Hidden size (H) | 768 | 1024 |
| Attention heads | 12 | 16 |
| Total parameters | ~110 M | ~340 M |
Each encoder layer applies successively:
- Multi-Head Self-Attention: computes relationships between all tokens in the input sequence simultaneously
- Feed-Forward Network: position-wise non-linear transformation
- Layer Normalization and residual connections (skip connections)
Bidirectional self-attention means that each token can “see” all other tokens in the sequence — both to the left and to the right — which is the very essence of BERT’s bidirectionality.
Tokenization: WordPiece
BERT uses the WordPiece tokenization algorithm, a subword segmentation method that decomposes rare or unknown words into more frequent subwords. The vocabulary contains approximately 30,000 tokens.
The process works as follows:
- Common words are kept intact: “the,” “house,” “intelligent”
- Rare words are decomposed: “anticonstitutionally” → “anti”, “##constitution”, “##ally”
- The prefix
##indicates that the subword is part of the previous word
This approach elegantly solves the out-of-vocabulary (OOV) problem while maintaining a manageable vocabulary size. Representations are constructed by combining three types of embeddings:
E_total = E_token + E_segment + E_position
where E_token is the WordPiece token embedding, E_segment distinguishes sentences A and B, and E_position encodes the absolute position of the token in the sequence.
Intuition
To understand why BERT represents such a major advance, we need to look at what existed before.
Before BERT, traditional language models processed text in one direction. Classical language modeling architectures (such as unidirectional RNNs and LSTMs) read text left to right: each prediction could only use preceding words. Other approaches, like Peters et al.’s ELMo, used two separate models — one reading left to right and one right to left — then concatenated their representations. But this approach remained fundamentally superficial: the two directions never interacted with each other during training.
BERT changes everything by reading both directions simultaneously. Imagine reading a sentence starting from the middle, while perfectly understanding the context on both sides, as if your brain could absorb the entire sentence in one glance. This is exactly what the self-attention mechanism enables: each token interacts directly with all other tokens in a single pass.
Let’s take a concrete example. Consider the sentence: “The mouse escaped from the laboratory after chewing the cables.”
In a left-to-right unidirectional model, at the time of predicting “cables,” the model only has access to the preceding words. It cannot use the fact that “chewing” strongly suggests a solid object to be chewed. But BERT, by seeing the entire sentence simultaneously, understands that the masked word after “the” is very likely an object that the mouse can chew — and “cables” fits this context perfectly.
This ability to build deeply bidirectional contextual representations is the reason BERT surpassed previous models on virtually every existing NLP benchmark upon its publication. The analogy is striking: reading text unidirectionally is like assembling a puzzle piece by piece in order, whereas BERT looks at all the pieces at once and understands how they fit together globally.
Python Implementation
Installation
pip install transformers torch datasets
1. Sentiment Classification Pipeline
The easiest way to use BERT is through the Hugging Face pipeline, which allows sentiment classification in just a few lines of code:
from transformers import pipeline
sentiment_classifier = pipeline(
"sentiment-analysis",
model="nlptown/bert-base-multilingual-uncased-sentiment",
tokenizer="nlptown/bert-base-multilingual-uncased-sentiment"
)
text = "This product is absolutely fantastic, I highly recommend it to everyone!"
result = sentiment_classifier(text)
print(f"Analyzed text: '{text}'")
print(f"Sentiment: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.2%}")
Hugging Face’s multilingual pipeline natively supports French, which is particularly useful for French-language applications.
2. Fine-Tuning BERT on a Custom Dataset
For specific tasks, it is often necessary to adapt BERT to a particular domain. Here is a complete example of fine-tuning on a text classification dataset:
import torch
from transformers import (
BertTokenizerFast,
BertForSequenceClassification,
TrainingArguments,
Trainer,
DataCollatorWithPadding
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
model_name = "dbmdz/bert-base-french-europeana-cased"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3)
dataset = load_dataset("allocine", split="train[:5000]")
dataset = dataset.train_test_split(test_size=0.2)
def tokenize(examples):
return tokenizer(
examples["review"],
truncation=True,
padding="max_length",
max_length=512
)
tokenized_dataset = dataset.map(tokenize, batched=True)
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy_score(labels, predictions),
"f1": f1_score(labels, predictions, average="weighted")
}
training_args = TrainingArguments(
output_dir="./bert-french-finetune",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
num_train_epochs=3,
weight_decay=0.01,
warmup_ratio=0.1,
load_best_model_at_end=True,
metric_for_best_model="f1",
fp16=torch.cuda.is_available(),
logging_dir="./logs",
logging_steps=50,
)
collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
data_collator=collator,
compute_metrics=compute_metrics,
)
trainer.train()
results = trainer.evaluate()
print(f"Final results: {results}")
model.save_pretrained("./bert-french-finetune-final")
tokenizer.save_pretrained("./bert-french-finetune-final")
print("Model saved successfully!")
3. Feature Extraction with BERT
BERT can also serve as a contextual representation extractor, useful for clustering, similarity search, or visualization:
from transformers import BertModel, BertTokenizerFast
import torch
model = BertModel.from_pretrained("dbmdz/bert-base-french-europeana-cased")
tokenizer = BertTokenizerFast.from_pretrained("dbmdz/bert-base-french-europeana-cased")
model.eval()
def extract_bert_features(text):
inputs = tokenizer(
text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
cls_embedding = outputs.last_hidden_state[:, 0, :]
return cls_embedding.squeeze().numpy()
sentence1 = "The film was exciting from beginning to end."
sentence2 = "I did not like this film at all."
emb1 = extract_bert_features(sentence1)
emb2 = extract_bert_features(sentence2)
print(f"Embedding dimension: {emb1.shape}")
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([emb1], [emb2])[0][0]
print(f"Cosine similarity between the two sentences: {similarity:.4f}")
This approach transforms any text into a 768-dimensional vector (BERT-Base) that captures the deep semantic meaning of the text. Two similar sentences will have embeddings that are close in the vector space.
Hyperparameters
Tuning hyperparameters is crucial for achieving optimal performance with BERT. Here are the most important parameters to consider:
| Hyperparameter | Recommended Value | Description |
|---|---|---|
| max_length | 128–512 | Maximum input sequence length. Beyond 512 tokens, BERT cannot process (architectural limitation) |
| learning_rate | 2e-5 to 5e-5 | Learning rate. BERT is sensitive to this parameter: values that are too high destabilize pre-training |
| batch_size | 16–64 | Batch size per device. Larger = more stable, but requires more GPU memory |
| epochs | 2–4 | Number of passes over the training data. Beyond 4 epochs, overfitting is frequent |
| warmup_steps | 10% of total | Proportion of learning rate warmup steps. Essential for stabilizing the start of training |
Practical tuning tips:
- Start small: first train on a subset of your data (1,000 examples) to validate the pipeline before launching full training
- Monitor overfitting: if training loss continues to decrease but validation loss increases, reduce the number of epochs or increase dropout
- Gradient accumulation: if your GPU memory is insufficient for a large
batch_size, usegradient_accumulation_stepsto simulate larger batches - Weight decay: a value of 0.01 to 0.1 helps regularize the model and prevent overfitting
- Reproducible seed: set
seed=42inTrainingArgumentsto obtain reproducible results between runs - Mixed precision: enable
fp16=Trueon compatible NVIDIA GPUs to reduce memory usage and speed up training
Advantages and Limitations
Advantages
BERT has major advantages that explain its massive adoption in industry and research:
- Deep bidirectionality: unlike unidirectional models, BERT captures the complete context of each token, producing much richer and more nuanced representations
- Powerful transfer learning: pre-training on gigantic corpora (Wikipedia + BookCorpus, totaling 3.3 billion words) enables extremely efficient transfer learning to specific tasks with little annotated data
- Remarkable versatility: the same base model can be adapted to dozens of different tasks — classification, extraction, question answering, similarity — without major architectural modifications
- Multilingual support: the
bert-base-multilingual-casedandmBERTvariants cover more than 100 languages, enabling international applications without full retraining - Mature ecosystem: the Hugging Face Transformers library offers seamless integration with PyTorch and TensorFlow, facilitating experimentation and deployment
- Reproducibility: pre-trained weights are public and freely accessible, ensuring that anyone can reproduce the results
Limitations
Despite its impressive performance, BERT has several important limitations:
- Limited sequence length: the 512-token limit prevents processing of long documents without truncation or complex windowing strategies
- High computational cost: pre-training requires hundreds of TPUs over several days. Even fine-tuning requires a GPU with at least 8 GB of VRAM for reasonable batch sizes
- Inference latency: BERT-Base produces about 140 MB of parameters, resulting in significant latency for real-time applications requiring thousands of requests per second
- Hyperparameter sensitivity: the choice of learning rate, number of epochs, and batch size considerably influences final performance, requiring careful search
- Data biases: like all pre-trained models, BERT inherits biases present in its training data (Wikipedia and BookCorpus contain societal and cultural biases)
- Suboptimal tokenization for some languages: WordPiece can produce inefficient segmentations for morphologically rich languages (Finnish, Turkish, Arabic), degrading performance
Practical Use Cases
1. Sentiment Classification for Customer Review Analysis
Companies use BERT to automatically analyze customer reviews on their e-commerce platforms, social networks, and mobile applications. A model fine-tuned on domain-specific data (restaurant, hospitality, technology) can classify sentiments with accuracy often exceeding 90%. This application allows companies to detect customer dissatisfaction in real time, identify emerging trends, and prioritize responses to negative reviews. Platforms like Allociné or TripAdvisor could use this technology to automatically sort millions of daily reviews.
2. Question Answering System
BERT can be trained to extract the answer to a question directly from a text document. This capability relies on bidirectional understanding: the model reads both the question and the context to identify the most relevant passage. Concretely, this powers customer support chatbots, internal enterprise search systems, and intelligent assistants for technical documentation. For example, a legal aid service could use BERT to instantly search for relevant law articles from a question posed in natural language by a citizen.
3. Named Entity Recognition (NER)
Named entity extraction consists of identifying and classifying key elements in a text: people, organizations, locations, dates, financial amounts, etc. BERT excels at this task thanks to its deep contextual understanding. In the medical field, BERT can automatically extract critical information from clinical reports: drug names, dosages, diagnoses, consultation dates. In the legal field, it can identify involved parties, judgment references, and statute of limitations. This automation considerably reduces the time for manual document processing.
4. Semantic Similarity and Document Search
Using BERT embeddings as vector representations of texts, it is possible to compute semantic similarity between documents. This approach far surpasses traditional keyword search: two texts can share a deep meaning without using the same terms. Concretely, this technology powers enterprise internal search engines, academic plagiarism detection systems, searches for similar legal case law, and personalized content recommendation. A law student could search for a legal concept and obtain relevant judgment references, even if they use different terminology.
See Also
- Become a ‘Trillionaire’ by Coding: How Python Can Revolutionize Your Financial Projects
- Mastering Colorful Charts in Python: Complete Guide for Beginners and Experts

