ResNet: Complete Guide
Summary
ResNet (Residual Network) is a revolutionary architecture introduced by Kaiming He and his collaborators in 2015 in their foundational paper “Deep Residual Learning for Image Recognition.” This contribution won first place in the ILSVRC 2015 challenge with a classification error of only 3.57%, surpassing human accuracy on ImageNet for the first time.
The central innovation of ResNet lies in residual connections (or skip connections), which enable training extremely deep networks — up to 152 layers and well beyond — without suffering from the famous degradation problem that previously limited the depth of convolutional neural networks.
Before ResNet, researchers noticed with puzzlement that increasing a network’s depth beyond a certain limit did not improve performance but worsened it. This phenomenon, called the degradation problem, was not related to overfitting since the error also increased on the training set. ResNet solved this paradox with a mathematically elegantly simple idea: instead of forcing each block to learn a complete transformation, it only needs to learn the difference (the residual) from its input. This change of perspective paved the way for networks of over a thousand layers and profoundly transformed the field of computer vision.
Mathematical Principle of ResNet
The Residual Formulation: H(x) = F(x) + x
The fundamental mathematical contribution of ResNet is based on a reformulation of what a block of layers should learn.
In a conventional neural network, a block of n layers directly learns a complex transformation H(x) from input x. The network must model the entire desired function:
output = H(x)
ResNet instead proposes decomposing this transformation into two parts:
H(x) = F(x) + x
where:
- x is the block’s input (transmitted directly via a residual connection, called a skip connection or shortcut).
- F(x) is the residual learned by the block’s layers. It is the difference between the desired transformation H(x) and the identity x.
- H(x) is the block’s final output, obtained by adding the residual F(x) to the input x.
Why is this reformulation so effective? Several deep mathematical reasons explain this success:
1. Initialization toward identity. If the optimal transformation is simply the identity (i.e., the network needs nothing to modify at this level), then F(x) → 0 is sufficient. It is much easier for a network to “zero out” its weights (drive them toward zero) than to build an identity transformation from scratch. Convolution weights are typically initialized with small random values close to zero, meaning F(x) is naturally close to zero at the start of training. The network therefore starts already near the identity solution.
2. Direct gradient flow. During backpropagation, the gradient arriving at the block can flow directly through the residual connection, without passing through the nonlinear transformations of the internal layers. Mathematically, if the final loss is ℒ, then:
∂ℒ/∂x = ∂ℒ/∂H × ∂H/∂x = ∂ℒ/∂H × (∂F/∂x + I)
The +I term (identity matrix) guarantees that part of the gradient reaches the initial layers directly, considerably mitigating the vanishing gradient problem. It is like building a highway for the gradient, allowing it to travel from the end of the network to the beginning without being diluted by dozens of successive multiplications.
3. Progressive incremental learning. Instead of having to learn an arbitrarily complex function from the start, the network can begin by learning subtle modifications (small residuals) and then gradually refine them. This approach corresponds to an intuitive form of curriculum learning: the network starts from a reasonable base (the identity, or passing the input through as-is) and adds increasingly precise and meaningful transformations step by step.
The Two Types of Residual Blocks
ResNet defines two types of blocks depending on the compatibility of dimensions between the input and the output:
Identity Block
This block is used when the dimensions of the input x and the residual F(x) are identical. The addition is then direct and immediate:
output = F(x) + x
In this case, the residual connection is a simple element-wise addition. No transformation is needed because the tensors already have the same shape (same spatial dimensions and same number of channels).
Projection Block
When the dimensions do not match — typically because the block modifies the image’s spatial size (via a stride of 2 in a convolution) or changes the number of filters — the dimensions of x must be adapted before the addition. A 1×1 convolution is then used on the residual connection:
output = F(x) + W_s · x
where W_s is a 1×1 convolution without a bias term (projection convolutions typically use use_bias=False because the bias would be redundant with the batch normalization that follows). This convolution adapts both the number of channels and the spatial resolution if necessary.
Deep Architectures: ResNet-34 and ResNet-50+
ResNet offers several variants that differ in their number of layers and the structure of their fundamental blocks:
ResNet-34 (and lighter versions) — Basic Block:
The ResNet-18 and ResNet-34 versions use a simple fundamental block composed of two consecutive 3×3 convolutions, each followed by batch normalization and a ReLU activation. This block is effective but relatively expensive for very deep networks because each block contains two costly 3×3 convolutions.
ResNet-50, ResNet-101, and ResNet-152 — Bottleneck Block:
For deeper architectures, ResNet uses a more sophisticated bottleneck block, composed of three convolutions:
1×1 (reduction) → 3×3 (processing) → 1×1 (expansion)
This design is ingenious and computationally efficient:
- The first 1×1 convolution reduces the number of channels (e.g., from 256 to 64), considerably decreasing the computational cost of the central convolution.
- The second 3×3 convolution works on a reduced feature space, which is much less computationally expensive.
- The third 1×1 convolution restores the original number of channels (from 64 to 256).
For example, in the ResNet-50 bottleneck block, a block with 64-64-256 channels performs about 96% fewer computations than an equivalent basic block. Without this bottleneck, ResNet-152 would be practically impossible to train in a reasonable time.
Here is the typical structure of ResNet-50:
– Initial layer: 7×7 convolution with 64 filters (stride=2) + 3×3 max pooling
– 4 bottleneck block stages:
– Stage 1: 3 blocks (64 → 64 → 256 channels)
– Stage 2: 4 blocks (128 → 128 → 512 channels)
– Stage 3: 6 blocks (256 → 256 → 1024 channels)
– Stage 4: 3 blocks (512 → 512 → 2048 channels)
– Global Average Pooling + final dense layer (softmax)
Intuition: The Artist and the Sketch
To deeply understand why ResNet works so well, imagine the following analogy:
Think of an artist creating a portrait. Two approaches are possible:
Classic approach (conventional network): The artist starts with a completely white canvas and must paint the entire portrait, stroke by stroke, color by color. Each layer of the network is like an artist who must add a significant contribution. For the early layers, it is easy: lay out the general outlines, define the composition. But for the deep layers, the task becomes extraordinarily difficult — each new layer must precisely modify the existing image without destroying it, like an artist painting over previous layers while having to preserve what has already been done.
This is exactly the problem with conventional deep networks: each layer must learn a complete and meaningful transformation. When the network is very deep, the layers farthest from the input must manage representations that have become extremely abstract and complex, which is numerically unstable and difficult to optimize.
ResNet approach (residual network): Now imagine the artist starts with a rough but reasonable sketch of the face. Their task no longer involves painting the entire portrait, but simply making touch-ups to the existing sketch. The first brushstrokes correct the shape of the eyes, the next ones adjust the shadow of the nose, then the color of the lips is refined, and so on.
This is exactly what ResNet does! The input x is the sketch, and F(x) are the touch-ups. If a deep layer realizes it has nothing relevant to add — that it cannot improve the representation — it can simply learn F(x) ≈ 0 and let the sketch pass through as-is. No information is lost, no damage is done.
This analogy explains why ResNet solves the degradation problem: even if some deep layers do not contribute significantly, the skip connections ensure that information from earlier layers is never destroyed. The network can be as deep as necessary without risk of “breaking” the useful representations of the early layers.
Furthermore, deep layers can pass information directly through skip connections, as if the original sketch passed through the entire canvas to the final result, undergoing only minor adjustments at each step. This direct transmission capability is what allows training networks of over a hundred layers without information degradation.
Python Implementation with Keras
Here is a complete implementation of ResNet from scratch, using Keras’s functional API. We will present the basic block, the bottleneck block, and then training on CIFAR-10.
Basic Residual Block
import tensorflow as tf
from tensorflow.keras.layers import (
Conv2D, BatchNormalization, Activation, Add,
Input, GlobalAveragePooling2D, Dense, MaxPooling2D
)
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2
def basic_block(x, filters, stride=1, name_prefix=None):
"""
Basic residual block: 2 3x3 convolutions.
Used in ResNet-18 and ResNet-34.
Arguments:
x : input tensor
filters : number of output filters
stride : stride of the first convolution (2 to reduce resolution)
name_prefix : prefix for naming layers
Returns:
The block's output tensor
"""
shortcut = x
# --- Main branch (F(x)) ---
x = Conv2D(
filters, kernel_size=3, strides=stride,
padding="same", use_bias=False,
kernel_regularizer=l2(1e-4),
name=f"{name_prefix}_conv1"
)(x)
x = BatchNormalization(name=f"{name_prefix}_bn1")(x)
x = Activation("relu", name=f"{name_prefix}_relu1")(x)
x = Conv2D(
filters, kernel_size=3, strides=1,
padding="same", use_bias=False,
kernel_regularizer=l2(1e-4),
name=f"{name_prefix}_conv2"
)(x)
x = BatchNormalization(name=f"{name_prefix}_bn2")(x)
# --- Residual connection (projection if necessary) ---
if stride != 1 or shortcut.shape[-1] != filters:
shortcut = Conv2D(
filters, kernel_size=1, strides=stride,
padding="same", use_bias=False,
kernel_regularizer=l2(1e-4),
name=f"{name_prefix}_proj"
)(shortcut)
shortcut = BatchNormalization(name=f"{name_prefix}_proj_bn")(shortcut)
# --- Addition and activation ---
x = Add(name=f"{name_prefix}_add")([x, shortcut])
x = Activation("relu", name=f"{name_prefix}_relu_final")(x)
return x
Bottleneck Block
def bottleneck_block(x, filters, stride=1, expand_ratio=4, name_prefix=None):
"""
Bottleneck residual block: 1x1 (reduction) → 3x3 → 1x1 (expansion).
Used in ResNet-50, ResNet-101, and ResNet-152.
Arguments:
x : input tensor
filters : number of intermediate filters (bottleneck)
stride : stride of the 3x3 convolution
expand_ratio : expansion factor (default 4)
name_prefix : prefix for naming layers
Returns:
The block's output tensor
"""
shortcut = x
expanded_filters = filters * expand_ratio # ex: 64 * 4 = 256
# --- Main branch (F(x)) ---
# Step 1: reduction (1x1)
x = Conv2D(
filters, kernel_size=1, strides=1,
padding="valid", use_bias=False,
kernel_regularizer=l2(1e-4),
name=f"{name_prefix}_conv1"
)(x)
x = BatchNormalization(name=f"{name_prefix}_bn1")(x)
x = Activation("relu", name=f"{name_prefix}_relu1")(x)
# Step 2: main 3x3 convolution
x = Conv2D(
filters, kernel_size=3, strides=stride,
padding="same", use_bias=False,
kernel_regularizer=l2(1e-4),
name=f"{name_prefix}_conv2"
)(x)
x = BatchNormalization(name=f"{name_prefix}_bn2")(x)
x = Activation("relu", name=f"{name_prefix}_relu2")(x)
# Step 3: expansion (1x1)
x = Conv2D(
expanded_filters, kernel_size=1, strides=1,
padding="valid", use_bias=False,
kernel_regularizer=l2(1e-4),
name=f"{name_prefix}_conv3"
)(x)
x = BatchNormalization(name=f"{name_prefix}_bn3")(x)
# --- Residual connection ---
if stride != 1 or shortcut.shape[-1] != expanded_filters:
shortcut = Conv2D(
expanded_filters, kernel_size=1, strides=stride,
padding="same", use_bias=False,
kernel_regularizer=l2(1e-4),
name=f"{name_prefix}_proj"
)(shortcut)
shortcut = BatchNormalization(name=f"{name_prefix}_proj_bn")(shortcut)
# --- Addition and final activation ---
x = Add(name=f"{name_prefix}_add")([x, shortcut])
x = Activation("relu", name=f"{name_prefix}_relu_final")(x)
return x
Building the Complete ResNet-34 Model
def build_resnet34(input_shape=(32, 32, 3), num_classes=10):
"""
Builds a ResNet-34 model using Keras's functional API.
Architecture:
- Initial layer + MaxPooling
- Stage 1: 3 basic blocks (filters=64)
- Stage 2: 4 basic blocks (filters=128)
- Stage 3: 6 basic blocks (filters=256)
- Stage 4: 3 basic blocks (filters=512)
- GAP + Dense
Returns: compiled Keras model
"""
inputs = Input(shape=input_shape)
# Initial layer (first conv)
x = Conv2D(
64, kernel_size=3, strides=1, padding="same",
use_bias=False, kernel_regularizer=l2(1e-4),
name="stem_conv"
)(inputs)
x = BatchNormalization(name="stem_bn")(x)
x = Activation("relu", name="stem_relu")(x)
x = MaxPooling2D(pool_size=3, strides=1, padding="same", name="stem_pool")(x)
# Stage 1: (64 filters, 3 blocks)
x = basic_block(x, filters=64, stride=1, name_prefix="stage1_block1")
x = basic_block(x, filters=64, stride=1, name_prefix="stage1_block2")
x = basic_block(x, filters=64, stride=1, name_prefix="stage1_block3")
# Stage 2: (128 filters, 4 blocks)
x = basic_block(x, filters=128, stride=2, name_prefix="stage2_block1")
for i in range(2, 5):
x = basic_block(x, filters=128, stride=1, name_prefix=f"stage2_block{i}")
# Stage 3: (256 filters, 6 blocks)
x = basic_block(x, filters=256, stride=2, name_prefix="stage3_block1")
for i in range(2, 7):
x = basic_block(x, filters=256, stride=1, name_prefix=f"stage3_block{i}")
# Stage 4: (512 filters, 3 blocks)
x = basic_block(x, filters=512, stride=2, name_prefix="stage4_block1")
x = basic_block(x, filters=512, stride=1, name_prefix="stage4_block2")
x = basic_block(x, filters=512, stride=1, name_prefix="stage4_block3")
# Global Average Pooling + classification
x = GlobalAveragePooling2D(name="global_avg_pool")(x)
outputs = Dense(num_classes, activation="softmax", name="predictions")(x)
model = Model(inputs, outputs, name="ResNet34")
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
return model
# Build and summarize
model = build_resnet34()
model.summary()
Training on CIFAR-10 and Comparison with VGG
import numpy as np
# Load CIFAR-10 data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
# Build ResNet-34 model
model = build_resnet34(input_shape=(32, 32, 3), num_classes=10)
# Callbacks: learning rate reduction and early stopping
callbacks = [
tf.keras.callbacks.ReduceLROnPlateau(
monitor="val_loss", factor=0.1, patience=5, min_lr=1e-6
),
tf.keras.callbacks.EarlyStopping(
monitor="val_accuracy", patience=15, restore_best_weights=True
),
]
# Training (on a modern GPU, ~2-4h with 50 epochs)
history = model.fit(
x_train, y_train,
validation_split=0.1,
epochs=50,
batch_size=128,
callbacks=callbacks,
verbose=1,
)
# Evaluation
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print(f"ResNet-34 accuracy on CIFAR-10: {test_acc:.4f}")
# --- Comparison with VGG-16 ---
# VGG-16 is a "plain" network (without skip connections) with 16 layers.
# On CIFAR-10, with the same number of epochs, VGG-16 reaches ~90-92%
# while ResNet-34 typically reaches ~92-94%, demonstrating that
# useful depth surpasses raw performance.
#
# The real advantage of ResNet shows on ImageNet:
# VGG-16: error ~7.3% (16 layers)
# ResNet-50: error ~3.6% (50 layers)
# ResNet-152: error ~3.57% (152 layers)
# ResNet-34 only has twice as many layers as VGG-16,
# but its skip connections allow it to be far more efficient.
Key Hyperparameters
| Hyperparameter | Description | Typical Values | Impact |
|---|---|---|---|
| Depth | Total number of convolutional layers | 18, 34, 50, 101, 152 | Deeper = more expressive but more expensive |
| Block type | basic (2 convs) vs bottleneck (3 convs) | Basic (ResNet-18/34), Bottleneck (50+) | Bottleneck reduces computation by ~96% |
| Initial filters | Number of filters in the first stage | 64 | Determines the network’s overall capacity |
| Strides | Spatial reduction between stages | 1 or 2 | Stride=2 halves the resolution |
| Expansion ratio | Bottleneck ratio | 4 (standard) | Controls the intermediate dimension |
| Learning rate | Initial learning rate | 0.001 (Adam) or 0.1 (SGD) | Multiplicative decay recommended |
| Batch size | Training batch size | 128-256 | Affects batch normalization stability |
| Weight decay | L2 regularization | 1e-4 | Prevents overfitting on small datasets |
Important recommendation: for ResNet, SGD with momentum (0.9) and a step-decay learning rate schedule (×0.1 at epochs 30, 60, 90) remains the reference optimizer to achieve the best performance on ImageNet. In practice, Adam converges faster and also gives excellent results, particularly useful for prototyping and training on smaller datasets.
Advantages and Limitations
Advantages
- Depth without degradation — Skip connections eliminate the degradation problem: networks of 150+ layers can be trained and continue to improve with depth, unlike conventional networks.
- Preserved gradient flow — The gradient flows directly through residual connections, massively mitigating the vanishing gradient and making optimization much more stable.
- Exceptional generalization — Representations learned by ResNet are remarkably transferable to other tasks, domains, and modalities.
- Elegant and modular architecture — The block-independent design is extremely clean and easy to adapt, extend, or combine with other architectures.
- Lasting and fundamental impact — The residual connection concept has become a universal standard, reused in practically all modern architectures (EfficientNet, RegNet, ConvNeXt, and even transformer networks like the residual connections in attention blocks).
- Compatibility with normalization — The order Conv → BN → ReLU → Conv → BN → Add → ReLU (pre-activation variant) has proven particularly stable and effective.
Limitations
- High computational cost — ResNet-152 requires about 11 billion operations (FLOPs) for a single image, which is considerable for real-time deployment or embedded devices.
- Significant memory footprint — Intermediate activations must be stored in memory during training, limiting the maximum batch size on resource-constrained GPUs.
- Redundancy of deep layers — Studies have shown that in very deep ResNets, many layers learn near-zero residuals (F(x) ≈ 0), suggesting that some layers are redundant and the network could be compressed without significant performance loss.
- Not optimal for tasks requiring high spatial precision — ResNet uses standard convolutions that lose fine spatial information as depth increases. For tasks like semantic segmentation or precise object detection, architectures like U-Net or FPN are more appropriate.
- Sensitivity to initialization — Although less sensitive than plain networks, ResNet remains sensitive to correct weight initialization and the use of Batch Normalization to stabilize training.
4 Concrete Use Cases
1. Medical Image Classification
ResNets pre-trained on ImageNet are widely used for chest X-ray classification, pulmonary nodule detection, or cancer cell analysis in biopsies. Transfer learning from ResNet-50 or ResNet-101 allows reaching clinical-grade performance with only a few thousand annotated images, since the low-level features (edges, textures, patterns) learned on ImageNet are universally transferable.
Example: Early detection of diabetic retinopathy from retinal photographs, with a fine-tuned ResNet-50 achieving sensitivity above 90%.
2. Industrial Anomaly Detection
In manufacturing, ResNet is integrated into automated visual inspection systems to detect defects on production lines: cracks, deformations, surface imperfections, defective assemblies. A fine-tuned ResNet-34 can classify hundreds of types of defects in real time with accuracy unattainable by traditional computer vision methods.
3. Backbone for Object Detection
ResNet serves as the reference backbone (feature extraction network) for many object detectors: Faster R-CNN, RetinaNet, and Mask R-CNN all use ResNet as their feature extractor. The multi-scale features extracted by ResNet’s different stages are perfect for detecting objects of varying sizes in an image.
4. Text and Time Series Classification
Although designed for vision, ResNet also applies to text processing (by replacing spatial convolutions with 1D convolutions on embeddings) and time series analysis (seismology, finance, IoT). The notion of residual connection is universal: it applies to any deep architecture where information flow must be preserved through many successive transformations.
See Also
- Master the Pisano Periods in Python: Complete Developer Guide
- Windows | Add Git to Environment Variables

