Transfer Learning: Transfer Learning

Transfer Learning : Guide Complet — Apprentissage par Transfert

Transfer Learning: Complete Guide — Transfer Learning

Summary — Transfer Learning consists of reusing a model pre-trained on a source task to solve a different but related target task. Rather than training a neural network from scratch (which requires millions of data points and days of computation), we start from an already performant model and adapt it. It is now standard practice in deep learning, used in computer vision, language processing, and audio.


Mathematical Principle

1. Core Concept

The central idea is that the early layers of a deep network learn generic and reusable features:

  • Lower layers (near the input): detect simple patterns (edges, textures, gradients in vision; n-grams, syntax in NLP).
  • Intermediate layers: combine these patterns into more complex structures (shapes, repeating patterns, phrases).
  • Higher layers (near the output): learn task-specific features (faces, wheels, specific keywords).

Transfer learning exploits this principle: we keep the lower layers (generic) and replace the higher layers (specific).

2. Transfer Approaches

Feature Extraction:
We freeze all layers of the pre-trained model and add a new classifier on top:

$$\hat{y} = g_\theta(f_{pre}(x))$$

where $f_{pre}$ is the pre-trained model (frozen parameters) and $g_\theta$ is the new classifier (only trained parameters).

Fine-Tuning:
We partially unfreeze the model and retrain certain layers with a low learning rate:

$$\theta_{new} = \theta_{pre} – \eta_{small} \cdot \nabla_\theta L(\theta)$$

The low learning rate ensures that pre-trained weights are not destroyed by updates.

3. Progressive Unfreezing Strategy

The recommended practical strategy is:
1. Phase 1: freeze the entire backbone, train only the classification head.
2. Phase 2: unfreeze the last 20-30% of backbone layers, fine-tune with low lr.
3. Phase 3 (optional): unfreeze more if the target dataset is large.

4. Popular Pre-trained Models

Model Domain Params ImageNet Top-1
VGG16 Vision 138M 71.3%
ResNet50 Vision 25M 76.1%
EfficientNetB0 Vision 5.3M 77.1%
BERT-base NLP 110M
DistilBERT NLP 66M

Intuition

Imagine a classical musician who decides to learn jazz.

Without transfer learning (from scratch): they would learn to hold an instrument, read a score, understand rhythm — like a child discovering music.

With transfer learning: they use their already acquired technique (note reading, finger dexterity, sense of rhythm) and focus only on what changes in jazz (improvisation, swing, extended harmonies). Their learning is 50 times faster.

This is exactly what transfer learning does: the lower layers of the model are like basic musical technique — they are universal and transfer everywhere.


Python Implementation

Example 1: Feature Extraction with VGG16

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models

# Load VGG16 pre-trained without the classification head
base_model = keras.applications.VGG16(
    weights='imagenet',
    include_top=False,
    input_shape=(224, 224, 3),
    pooling='avg'
)

# Freeze all layers
base_model.trainable = False

# Add our classifier
model = models.Sequential([
    base_model,
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')  # Binary classification
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

Example 2: Fine-Tuning with ResNet50

# Load ResNet50
base = keras.applications.ResNet50(
    weights='imagenet',
    include_top=False,
    input_shape=(224, 224, 3),
    pooling='avg'
)

# Phase 1: freeze everything and train the head
base.trainable = False
head = keras.Sequential([
    base,
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')  # 10 classes
])
head.compile(optimizer='adam', loss='categorical_crossentropy',
             metrics=['accuracy'])
# head.fit(train_data, epochs=5) # phase 1 training

# Phase 2: unfreeze the last 50 layers
base.trainable = True
for layer in base.layers[:-50]:
    layer.trainable = False

# Fine-tuning with very low learning rate
head.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-5),
             loss='categorical_crossentropy',
             metrics=['accuracy'])
# head.fit(train_data, epochs=10) # phase 2 training

Example 3: Data Augmentation for Transfer

# Essential data augmentation when the target dataset is small
data_aug = keras.Sequential([
    layers.RandomFlip('horizontal'),
    layers.RandomRotation(0.1),
    layers.RandomZoom(0.1),
    layers.RandomContrast(0.1),
], name='data_augmentation')

model_aug = models.Sequential([
    data_aug,
    keras.applications.MobileNetV2(weights='imagenet',
        include_top=False, input_shape=(224, 224, 3), pooling='avg'),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(1, activation='sigmoid')
])

# Note: augmentation layers are inactive in inference mode

Hyperparameters

Hyperparameter Typical Value Description
freeze_layers All or partial How many layers to freeze? The smaller the target dataset, the more you freeze
learning_rate_finetune 1e-5 to 1e-4 Very low LR for fine-tuning, otherwise pre-trained weights are destroyed
learning_rate_head 1e-3 Normal LR for the classification head
batch_size 16-64 Smaller for fine-tuning (memory + stability)
epochs_phase1 5-10 Feature extraction phase only
epochs_phase2 10-30 Fine-tuning phase

Advantages of Transfer Learning

  1. Performance with little data: A pre-trained model achieves acceptable performance with only 100-1000 examples per class, whereas training from scratch would require tens of thousands.
  2. Colossal time savings: Instead of days or weeks of GPU training, fine-tuning takes minutes to hours.
  3. Better convergence: Pre-trained weights are already in a good region of the parameter space. Convergence is faster and more stable.
  4. Accessibility: No need for GPU clusters to benefit from deep learning. A single GPU is enough to fine-tune a state-of-the-art model.
  5. Industry standard: In 2026, it is extremely rare to train a model from scratch. Transfer learning is the norm in all domains.

Limitations of Transfer Learning

  1. Domain gap: If the source dataset (ImageNet) and target dataset (e.g., medical images) are too different, learned features may not transfer effectively.
  2. Source model bias: Biases present in the source training data propagate to the fine-tuned model (gender, ethnic, cultural bias).
  3. Model size: Pre-trained models like VGG16 weigh 500 MB. For edge device deployment, lightweight models like MobileNet are needed.
  4. Overfitting on small datasets: If too many layers are unfrozen on a tiny dataset, the model can overfit rapidly.
  5. Source architecture dependency: If the target dataset has a different resolution or non-standard input format, adaptation requires additional technical adjustments.

4 Concrete Use Cases

1. Pathology Detection in Medical Imaging

A model pre-trained on ImageNet (natural images) is fine-tuned on chest X-rays to detect pneumonia. The first layers of VGG16, which detect edges and textures, transfer perfectly to bone contours and lung textures. With only a few hundred X-rays, the model achieves over 95% accuracy.

2. E-commerce Product Classification

E-commerce platforms fine-tune vision models to automatically classify product photos by category (clothing, electronics, furniture). A pre-trained ResNet50 achieves 95%+ accuracy with only a few hundred images per category.

3. French Sentiment Analysis with mBERT

Multilingual BERT (mBERT), pre-trained on 104 languages, is fine-tuned on a French annotated corpus for sentiment classification (positive/negative/neutral). Thanks to multilingual transfer, performance is excellent even with limited French data.

4. Industrial Defect Detection

In manufacturing, defect detection on parts requires few defect examples (because defects are rare by nature). A pre-trained model like EfficientNet, fine-tuned with a few hundred defect photos, achieves detection rates above 99%.


Conclusion

Transfer Learning is arguably the most practical technique of modern deep learning. It democratizes access to state-of-the-art models by reducing data and compute requirements by several orders of magnitude.

The recipe is simple: choose a domain-appropriate pre-trained model, freeze the backbone, add a classification head, train that head, then fine-tune the last few layers with a low learning rate.

In 2026, transfer learning is no longer an option — it is the default starting point for any deep learning project. Any team that ignores this approach condemns itself to mediocre results with an exponential budget.

See Also