BERT: Complete Guide — Bidirectional Language Models

Overview

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model developed by Google in 2018, which fundamentally transformed the field of natural language processing (NLP). Unlike earlier architectures that processed text unidirectionally, BERT reads context from both sides simultaneously through its bidirectional self-attention mechanism. This innovation enabled record-breaking performance on numerous tasks: text classification, question answering, named entity recognition, semantic similarity analysis, and many more. Published in the landmark paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT relies on two complementary training strategies — Masked Language Modeling and Next Sentence Prediction — that together produce linguistic representations of unprecedented richness. This guide explores the theoretical foundations, architecture, practical Python implementation, and concrete applications of BERT.

Mathematical Principle

Masked Language Modeling (MLM)

The core of BERT’s innovation lies in Masked Language Modeling. The idea is both simple and powerful: instead of predicting the next word in a sequence (the traditional language modeling approach), BERT randomly masks 15% of input tokens and learns to reconstruct the masked tokens using the complete context of the sentence.

Formally, given a sequence of tokens W = [w_1, w_2, …, w_N], we define a mask M where each position i has a probability p = 0.15 of being masked. The model optimizes the following likelihood function:

L_MLM = – sum_over_i_in_M [ log P(w_i | W_without_M) ]

where W_without_M represents the sequence with masked tokens replaced by the special [MASK] token.

The 15% masking strategy is broken down more precisely as follows:

80% of the time: the token is replaced by [MASK]
10% of the time: the token is replaced by a random token from the vocabulary
10% of the time: the token is left as-is

This clever distribution avoids the gap between pre-training and fine-tuning: since the [MASK] token never appears during fine-tuning, BERT does not become dependent on this artificial signal. The random substitution forces the model to remain uncertain and distribute its probabilities more robustly.

Next Sentence Prediction (NSP)

The second training task is Next Sentence Prediction. Given two sentences A and B, the model must predict whether B actually follows A in the original text (label IsNext) or whether B comes from a different document (label NotNext).

L_NSP = – log P(IsNext | A, B)

The total loss function combines both objectives:

L_total = L_MLM + L_NSP

This task teaches BERT the logical relationship between successive sentences, which proves particularly useful for applications like question answering and natural language inference (NLI), where understanding coherence between statements is essential.

Architecture: Stacked Encoder Layers

BERT’s architecture relies exclusively on stacked Transformer Encoder layers, with no decoder component. Unlike the original Transformer by Vaswani et al. (2017) which used an encoder and decoder for machine translation, BERT retains only the encoder portion.

Two main variants exist:

Parameter	BERT-Base	BERT-Large
Layers (L)	12	24
Hidden size (H)	768	1024
Attention heads	12	16
Total parameters	~110 M	~340 M

Each encoder layer applies successively:

Multi-Head Self-Attention: computes relationships between all tokens in the input sequence simultaneously
Feed-Forward Network: position-wise non-linear transformation
Layer Normalization and residual connections (skip connections)

Bidirectional self-attention means that each token can “see” all other tokens in the sequence — both to the left and to the right — which is the very essence of BERT’s bidirectionality.

Tokenization: WordPiece

BERT uses the WordPiece tokenization algorithm, a subword segmentation method that decomposes rare or unknown words into more frequent subwords. The vocabulary contains approximately 30,000 tokens.

The process works as follows:

Common words are kept intact: “the,” “house,” “intelligent”
Rare words are decomposed: “anticonstitutionally” → “anti”, “##constitution”, “##ally”
The prefix ## indicates that the subword is part of the previous word

This approach elegantly solves the out-of-vocabulary (OOV) problem while maintaining a manageable vocabulary size. Representations are constructed by combining three types of embeddings:

E_total = E_token + E_segment + E_position

where E_token is the WordPiece token embedding, E_segment distinguishes sentences A and B, and E_position encodes the absolute position of the token in the sequence.

Intuition

To understand why BERT represents such a major advance, we need to look at what existed before.

Before BERT, traditional language models processed text in one direction. Classical language modeling architectures (such as unidirectional RNNs and LSTMs) read text left to right: each prediction could only use preceding words. Other approaches, like Peters et al.’s ELMo, used two separate models — one reading left to right and one right to left — then concatenated their representations. But this approach remained fundamentally superficial: the two directions never interacted with each other during training.

BERT changes everything by reading both directions simultaneously. Imagine reading a sentence starting from the middle, while perfectly understanding the context on both sides, as if your brain could absorb the entire sentence in one glance. This is exactly what the self-attention mechanism enables: each token interacts directly with all other tokens in a single pass.

Let’s take a concrete example. Consider the sentence: “The mouse escaped from the laboratory after chewing the cables.”

In a left-to-right unidirectional model, at the time of predicting “cables,” the model only has access to the preceding words. It cannot use the fact that “chewing” strongly suggests a solid object to be chewed. But BERT, by seeing the entire sentence simultaneously, understands that the masked word after “the” is very likely an object that the mouse can chew — and “cables” fits this context perfectly.

This ability to build deeply bidirectional contextual representations is the reason BERT surpassed previous models on virtually every existing NLP benchmark upon its publication. The analogy is striking: reading text unidirectionally is like assembling a puzzle piece by piece in order, whereas BERT looks at all the pieces at once and understands how they fit together globally.

Python Implementation

Installation

pip install transformers torch datasets

1. Sentiment Classification Pipeline

The easiest way to use BERT is through the Hugging Face pipeline, which allows sentiment classification in just a few lines of code:

from transformers import pipeline

sentiment_classifier = pipeline(
    "sentiment-analysis",
    model="nlptown/bert-base-multilingual-uncased-sentiment",
    tokenizer="nlptown/bert-base-multilingual-uncased-sentiment"
)

text = "This product is absolutely fantastic, I highly recommend it to everyone!"
result = sentiment_classifier(text)

print(f"Analyzed text: '{text}'")
print(f"Sentiment: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.2%}")

Hugging Face’s multilingual pipeline natively supports French, which is particularly useful for French-language applications.

2. Fine-Tuning BERT on a Custom Dataset

For specific tasks, it is often necessary to adapt BERT to a particular domain. Here is a complete example of fine-tuning on a text classification dataset:

import torch
from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

model_name = "dbmdz/bert-base-french-europeana-cased"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3)

dataset = load_dataset("allocine", split="train[:5000]")
dataset = dataset.train_test_split(test_size=0.2)

def tokenize(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        padding="max_length",
        max_length=512
    )

tokenized_dataset = dataset.map(tokenize, batched=True)
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average="weighted")
    }

training_args = TrainingArguments(
    output_dir="./bert-french-finetune",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    warmup_ratio=0.1,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    fp16=torch.cuda.is_available(),
    logging_dir="./logs",
    logging_steps=50,
)

collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

trainer.train()

results = trainer.evaluate()
print(f"Final results: {results}")

model.save_pretrained("./bert-french-finetune-final")
tokenizer.save_pretrained("./bert-french-finetune-final")
print("Model saved successfully!")

3. Feature Extraction with BERT

BERT can also serve as a contextual representation extractor, useful for clustering, similarity search, or visualization:

from transformers import BertModel, BertTokenizerFast
import torch

model = BertModel.from_pretrained("dbmdz/bert-base-french-europeana-cased")
tokenizer = BertTokenizerFast.from_pretrained("dbmdz/bert-base-french-europeana-cased")

model.eval()

def extract_bert_features(text):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512
    )

    with torch.no_grad():
        outputs = model(**inputs)

    cls_embedding = outputs.last_hidden_state[:, 0, :]
    return cls_embedding.squeeze().numpy()

sentence1 = "The film was exciting from beginning to end."
sentence2 = "I did not like this film at all."

emb1 = extract_bert_features(sentence1)
emb2 = extract_bert_features(sentence2)

print(f"Embedding dimension: {emb1.shape}")

from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity([emb1], [emb2])[0][0]
print(f"Cosine similarity between the two sentences: {similarity:.4f}")

This approach transforms any text into a 768-dimensional vector (BERT-Base) that captures the deep semantic meaning of the text. Two similar sentences will have embeddings that are close in the vector space.

Hyperparameters

Tuning hyperparameters is crucial for achieving optimal performance with BERT. Here are the most important parameters to consider:

Hyperparameter	Recommended Value	Description
max_length	128–512	Maximum input sequence length. Beyond 512 tokens, BERT cannot process (architectural limitation)
learning_rate	2e-5 to 5e-5	Learning rate. BERT is sensitive to this parameter: values that are too high destabilize pre-training
batch_size	16–64	Batch size per device. Larger = more stable, but requires more GPU memory
epochs	2–4	Number of passes over the training data. Beyond 4 epochs, overfitting is frequent
warmup_steps	10% of total	Proportion of learning rate warmup steps. Essential for stabilizing the start of training

Practical tuning tips:

Start small: first train on a subset of your data (1,000 examples) to validate the pipeline before launching full training
Monitor overfitting: if training loss continues to decrease but validation loss increases, reduce the number of epochs or increase dropout
Gradient accumulation: if your GPU memory is insufficient for a large batch_size, use gradient_accumulation_steps to simulate larger batches
Weight decay: a value of 0.01 to 0.1 helps regularize the model and prevent overfitting
Reproducible seed: set seed=42 in TrainingArguments to obtain reproducible results between runs
Mixed precision: enable fp16=True on compatible NVIDIA GPUs to reduce memory usage and speed up training

Advantages and Limitations

Advantages

BERT has major advantages that explain its massive adoption in industry and research:

Deep bidirectionality: unlike unidirectional models, BERT captures the complete context of each token, producing much richer and more nuanced representations
Powerful transfer learning: pre-training on gigantic corpora (Wikipedia + BookCorpus, totaling 3.3 billion words) enables extremely efficient transfer learning to specific tasks with little annotated data
Remarkable versatility: the same base model can be adapted to dozens of different tasks — classification, extraction, question answering, similarity — without major architectural modifications
Multilingual support: the bert-base-multilingual-cased and mBERT variants cover more than 100 languages, enabling international applications without full retraining
Mature ecosystem: the Hugging Face Transformers library offers seamless integration with PyTorch and TensorFlow, facilitating experimentation and deployment
Reproducibility: pre-trained weights are public and freely accessible, ensuring that anyone can reproduce the results

Limitations

Despite its impressive performance, BERT has several important limitations:

Limited sequence length: the 512-token limit prevents processing of long documents without truncation or complex windowing strategies
High computational cost: pre-training requires hundreds of TPUs over several days. Even fine-tuning requires a GPU with at least 8 GB of VRAM for reasonable batch sizes
Inference latency: BERT-Base produces about 140 MB of parameters, resulting in significant latency for real-time applications requiring thousands of requests per second
Hyperparameter sensitivity: the choice of learning rate, number of epochs, and batch size considerably influences final performance, requiring careful search
Data biases: like all pre-trained models, BERT inherits biases present in its training data (Wikipedia and BookCorpus contain societal and cultural biases)
Suboptimal tokenization for some languages: WordPiece can produce inefficient segmentations for morphologically rich languages (Finnish, Turkish, Arabic), degrading performance

Practical Use Cases

1. Sentiment Classification for Customer Review Analysis

Companies use BERT to automatically analyze customer reviews on their e-commerce platforms, social networks, and mobile applications. A model fine-tuned on domain-specific data (restaurant, hospitality, technology) can classify sentiments with accuracy often exceeding 90%. This application allows companies to detect customer dissatisfaction in real time, identify emerging trends, and prioritize responses to negative reviews. Platforms like Allociné or TripAdvisor could use this technology to automatically sort millions of daily reviews.

2. Question Answering System

BERT can be trained to extract the answer to a question directly from a text document. This capability relies on bidirectional understanding: the model reads both the question and the context to identify the most relevant passage. Concretely, this powers customer support chatbots, internal enterprise search systems, and intelligent assistants for technical documentation. For example, a legal aid service could use BERT to instantly search for relevant law articles from a question posed in natural language by a citizen.

3. Named Entity Recognition (NER)

Named entity extraction consists of identifying and classifying key elements in a text: people, organizations, locations, dates, financial amounts, etc. BERT excels at this task thanks to its deep contextual understanding. In the medical field, BERT can automatically extract critical information from clinical reports: drug names, dosages, diagnoses, consultation dates. In the legal field, it can identify involved parties, judgment references, and statute of limitations. This automation considerably reduces the time for manual document processing.

4. Semantic Similarity and Document Search

Using BERT embeddings as vector representations of texts, it is possible to compute semantic similarity between documents. This approach far surpasses traditional keyword search: two texts can share a deep meaning without using the same terms. Concretely, this technology powers enterprise internal search engines, academic plagiarism detection systems, searches for similar legal case law, and personalized content recommendation. A law student could search for a legal concept and obtain relevant judgment references, even if they use different terminology.

BERT: Bidirectional Language Models

BERT: Complete Guide — Bidirectional Language Models

Overview

Mathematical Principle

Masked Language Modeling (MLM)

Next Sentence Prediction (NSP)

Architecture: Stacked Encoder Layers

Tokenization: WordPiece

Intuition

Python Implementation

Installation

1. Sentiment Classification Pipeline

2. Fine-Tuning BERT on a Custom Dataset

3. Feature Extraction with BERT

Hyperparameters

Advantages and Limitations

Advantages

Limitations

Practical Use Cases

1. Sentiment Classification for Customer Review Analysis

2. Question Answering System

3. Named Entity Recognition (NER)

4. Semantic Similarity and Document Search

See Also

Articles similaires

About Salah YAHIAOUI

BERT: Complete Guide — Bidirectional Language Models

Overview

Mathematical Principle

Masked Language Modeling (MLM)

Next Sentence Prediction (NSP)

Architecture: Stacked Encoder Layers

Tokenization: WordPiece

Intuition

Python Implementation

Installation

1. Sentiment Classification Pipeline

2. Fine-Tuning BERT on a Custom Dataset

3. Feature Extraction with BERT

Hyperparameters

Advantages and Limitations

Advantages

Limitations

Practical Use Cases

1. Sentiment Classification for Customer Review Analysis

2. Question Answering System

3. Named Entity Recognition (NER)

4. Semantic Similarity and Document Search

See Also

Partager :

Articles similaires

Related Posts

Linear Regression: Principles, Examples, and Python Implementation

Régression Logistique : Guide Complet — Principes, Exemples et Implémentation Python

Flow Matching: Generation by Flow Matching

About Salah YAHIAOUI