CPC: Contrastive Predictive Coding

CPC : Guide Complet — Codage Prédictif Contrastif

Contrastive Predictive Coding (CPC) — The Complete Guide

Summary

Contrastive Predictive Coding (CPC) is a self-supervised learning method introduced by Aaron van den Oord and his collaborators at DeepMind in 2018. The fundamental idea is revolutionary in its simplicity: instead of trying to faithfully reconstruct input data — as classic autoencoders do — CPC learns to predict future samples in a latent space. This approach allows the model to capture the deep temporal structure of data without any explicit labels.

CPC applies to all types of sequential data: speech, music, text, time series, and even visual data treated as sequences. Thanks to its InfoNCE contrastive loss function, the model learns to distinguish true signal from noise, producing representations of remarkable richness. These representations can then be leveraged for downstream tasks such as speech recognition, image classification, or anomaly detection in industrial sensor streams.

This guide explores in detail the mathematical principle of CPC, its deep intuition, its complete implementation in PyTorch, as well as its practical applications. Whether you are a deep learning researcher or an engineer wanting to understand the mechanisms behind self-supervised representations, you will find everything you need to master Contrastive Predictive Coding here.


Mathematical Principle of CPC

The operation of Contrastive Predictive Coding relies on three essential components that follow one another in a rigorous sequential pipeline.

Step 1: Encoding in the latent space

CPC encodes the raw data x_t observed at time step t into a compact latent representation z_t:

z_t = g_enc(x_t)

This encoding function g_enc is a nonlinear neural network — typically a CNN (Convolutional Neural Network) for audio signals, or a residual convolutional network for images. The goal of this encoder is to project high-dimensional data into a lower-dimensional latent space while preserving information relevant for future prediction. Unlike reconstruction methods that seek to minimize mean squared error, the CPC encoder is trained indirectly through the global contrastive loss.

Step 2: The autoregressive context model

Once the latent representations z_t are obtained, an autoregressive model — typically a GRU (Gated Recurrent Unit) or a Transformer — progressively computes a context vector c_t:

c_t = g_ar(z_{≤t})

This vector c_t is the compressed summary of all available past up to time t. The GRU maintains a hidden state h_t that evolves at each step:

h_t = GRU(h_{t-1}, z_t)
c_t = W_proj · h_t

The matrix W_proj performs a linear projection to adapt the dimension of the GRU hidden state to that of the latent space. The context c_t must contain enough information to anticipate future representations.

Step 3: Contrastive prediction via InfoNCE

The prediction of the future sample at offset k is evaluated by a weighted dot product:

f_k(z_{t+k}, c_t) = exp(z_{t+k}^T · W_k · c_t)

Here, W_k is a weight matrix specific to the future step k. The exponential ensures that the score function is strictly positive, which is essential for the loss formulation.

The InfoNCE loss for each future step k is defined as:

L_k = -E[ log( f_k(z_{t+k}^{positive}, c_t) / Σ_{j=1}^{N} f_k(z_{t+k}^{negative j}, c_t) ) ]

In this expression:
z_{t+k}^{positive} is the true future representation at offset k, i.e., the positive sample.
– The z_{t+k}^{negative j} are N – 1 negative samples drawn from other sequences in the same batch, plus the positive sample itself, giving a total of N candidates.
– The denominator sums the scores of all candidates, thus creating a normalized probability distribution.
– The loss seeks to maximize the score of the positive sample relative to the negatives, which amounts to maximizing a lower bound on the mutual information between c_t and z_{t+k}.

The total model loss is the sum of losses over all considered future steps:

L = Σ_{k=1}^{K} L_k

where K is the total number of prediction steps.

Mutual information maximization

The deep goal of CPC is to maximize the mutual information I(c_t ; z_{t+k}) between the current context and future samples. By maximizing this quantity, the model is forced to capture the inherent temporal structure of the data. The learned latent representations then become rich in semantics: for speech, they encode phonemes and prosody; for music, notes and rhythm; for time series, trends and seasonal cycles.

The formal relationship is expressed as:

I(c_t ; z_{t+k}) ≥ log(N) - L_k

where N is the total number of candidates (positive plus negatives). This bound shows that minimizing the InfoNCE loss directly amounts to maximizing a lower bound on mutual information — hence the name InfoNCE.


Deep Intuition

Imagine CPC as a sophisticated guessing game. The model is shown the beginning of a sentence or the first few bars of a piece of music, then presented with several potential continuations — only one is correct, the others are decoys randomly drawn from other sequences. The model must recognize the correct continuation among these candidates.

To succeed at this game consistently, the model cannot settle for memorizing superficial patterns. It must truly understand the underlying structure of the signal: the grammar and syntax of a language, the rhythm and harmony of a melody, the long-term correlations of a time series. And all this without ever having received a single explicit label — no transcription, no score, no annotation of any kind. This is the full power of self-supervised learning.

The fundamental difference with classical reconstruction methods (such as VAE or denoising autoencoders) is that CPC does not seek to reconstruct pixel by pixel or sample by sample. Reconstructing a signal is often too easy: a model can learn superficial shortcuts that minimize reconstruction error while ignoring the true structure of the signal. In contrast, predicting which future sequence is correct among several distractors forces the model to extract highly informative and discriminative features.

Consider a concrete example: in speech recognition, CPC learns that certain sequences of phonemes are plausible and others are unlikely. It discovers that the sound “tion” in English is frequently followed by a space or a vowel, but rarely by a stop consonant. This phonotactic knowledge emerges spontaneously from the future prediction task — no linguist needed to annotate these regularities.


Complete Python Implementation

Here is a complete implementation of Contrastive Predictive Coding in PyTorch, structured in a modular way to be easily adaptable to different types of data.

1. The Encoder

The encoder transforms raw data into latent representations. For audio, a 1D CNN is typically used. For images, a 2D CNN with residual layers.

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class CPCEncoder(nn.Module):
    """CPC encoder using 1D convolutions for sequential signals."""

    def __init__(self, input_dim: int = 1, latent_dim: int = 256,
                 num_layers: int = 5, kernel_size: int = 10, stride: int = 5):
        super().__init__()
        self.latent_dim = latent_dim

        # Stacking convolutional layers
        layers = []
        in_channels = input_dim
        for i in range(num_layers):
            out_channels = latent_dim if i == num_layers - 1 else latent_dim // 2
            layers.append(nn.Conv1d(
                in_channels=in_channels,
                out_channels=out_channels,
                kernel_size=kernel_size,
                stride=stride,
                padding=kernel_size // 2
            ))
            layers.append(nn.BatchNorm1d(out_channels))
            layers.append(nn.ReLU(inplace=True))
            in_channels = out_channels

        self.encoder = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, seq_len, input_dim) -> (batch, input_dim, seq_len)
        x = x.transpose(1, 2)
        z = self.encoder(x)
        # Return to (batch, latent_dim, reduced_seq_len)
        return z.transpose(1, 2)

2. The Autoregressive Model (Context Network)

The context network encodes temporal dynamics through a GRU.

class CPCContextNetwork(nn.Module):
    """Autoregressive network (GRU) that computes the context vector c_t."""

    def __init__(self, latent_dim: int = 256, hidden_dim: int = 256,
                 num_layers: int = 2):
        super().__init__()
        self.gru = nn.GRU(
            input_size=latent_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.1 if num_layers > 1 else 0.0
        )
        self.projection = nn.Linear(hidden_dim, latent_dim)

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        # z: (batch, seq_len, latent_dim)
        out, _ = self.gru(z)
        # out: (batch, seq_len, hidden_dim)
        c = self.projection(out)
        # c: (batch, seq_len, latent_dim)
        return c

3. The InfoNCE Loss

This is the heart of CPC: the contrastive loss function that distinguishes positive samples from negatives.

def compute_infonce_loss(context: torch.Tensor,
                         latent_future: torch.Tensor,
                         weight_matrices: nn.ParameterList,
                         num_negatives: int = 127):
    """
    Computes the CPC InfoNCE loss.

    Arguments:
        context:  (batch, seq_len, latent_dim) — context vectors c_t
        latent_future: (batch, seq_len + max_k, latent_dim) — future representations
        weight_matrices: list of W_k matrices for each prediction step
        num_negatives: total number of candidates (1 positive + N-1 negatives)

    Returns:
        total_loss: sum of losses over all future steps
        individual_losses: list of losses per step k
    """
    batch_size, seq_len, latent_dim = context.shape
    total_loss = 0.0
    individual_losses = []

    for k, w_k in enumerate(weight_matrices, start=1):
        # Select the corresponding contexts and futures
        c = context[:, :seq_len - k, :]          # (batch, seq_len-k, latent_dim)
        z_pos = latent_future[:, k:seq_len, :]   # (batch, seq_len-k, latent_dim)

        # Positive score: exp(z_pos^T . W_k . c)
        wz = torch.matmul(z_pos, w_k)            # (batch, seq_len-k, latent_dim)
        pos_score = torch.sum(wz * c, dim=-1)    # (batch, seq_len-k)
        pos_score = torch.exp(pos_score)          # (batch, seq_len-k)

        # Generate negatives by randomly shuffling within the batch
        n_neg = num_negatives - 1
        neg_indices = torch.randint(0, batch_size, (batch_size * (seq_len - k) * n_neg,),
                                     device=context.device)
        z_neg = z_pos.view(-1, latent_dim)[neg_indices]
        z_neg = z_neg.view(batch_size, seq_len - k, n_neg, latent_dim)

        wz_neg = torch.matmul(z_neg, w_k)
        neg_score = torch.sum(wz_neg * c.unsqueeze(2), dim=-1)
        neg_score = torch.exp(neg_score)

        # Combine positive and negatives
        all_scores = torch.cat([pos_score.unsqueeze(-1), neg_score], dim=-1)

        # Log-softmax normalization
        log_softmax = F.log_softmax(all_scores, dim=-1)
        loss_k = -log_softmax[:, :, 0].mean()

        total_loss = total_loss + loss_k
        individual_losses.append(loss_k.item())

    return total_loss, individual_losses

4. The Complete CPC Model

Here is the complete model that assembles all components:

class CPCModel(nn.Module):
    """Complete CPC model: encoder + autoregressive model + prediction heads."""

    def __init__(self, input_dim: int = 1, latent_dim: int = 256,
                 context_hidden: int = 256, num_layers_gru: int = 2,
                 num_prediction_steps: int = 12, num_negatives: int = 127):
        super().__init__()
        self.encoder = CPCEncoder(input_dim, latent_dim)
        self.context_net = CPCContextNetwork(latent_dim, context_hidden,
                                              num_layers_gru)
        self.num_prediction_steps = num_prediction_steps
        self.num_negatives = num_negatives

        # One W_k matrix per future prediction step
        self.prediction_weights = nn.ParameterList([
            nn.Parameter(torch.randn(latent_dim, latent_dim) * 0.01)
            for _ in range(num_prediction_steps)
        ])

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        x: (batch, seq_len, input_dim)
        Returns the total CPC loss.
        """
        z = self.encoder(x)
        c = self.context_net(z)

        loss, per_step = compute_infonce_loss(
            c, z, self.prediction_weights, self.num_negatives
        )
        return loss, per_step

5. Training Loop

def train_cpc(model: CPCModel, dataloader, num_epochs: int = 50,
              lr: float = 1e-3, device: str = "cuda"):
    """Trains the CPC model with the InfoNCE loss."""
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)

    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0.0
        num_batches = 0

        for batch in dataloader:
            x = batch.to(device)  # (batch, seq_len, input_dim)
            loss, per_step = model(x)

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

            epoch_loss += loss.item()
            num_batches += 1

        scheduler.step()
        avg_loss = epoch_loss / max(num_batches, 1)
        print(f"[Epoch {epoch+1}/{num_epochs}] Mean CPC loss: {avg_loss:.4f}")

    return model

6. Usage Example on a Simulated Audio Signal

if __name__ == "__main__":
    # Parameters
    BATCH_SIZE = 64
    SEQ_LEN = 512
    INPUT_DIM = 1  # Mono signal
    LATENT_DIM = 128
    NUM_EPOCHS = 30

    # Simulated data: colored noise with temporal structure
    num_samples = 5000
    raw_data = np.zeros((num_samples, SEQ_LEN, INPUT_DIM), dtype=np.float32)
    for i in range(num_samples):
        signal = np.cumsum(np.random.randn(SEQ_LEN))  # Random walk
        signal = (signal - signal.mean()) / (signal.std() + 1e-8)
        raw_data[i, :, 0] = signal

    dataset = torch.tensor(raw_data)
    dataloader = torch.utils.data.DataLoader(
        dataset, batch_size=BATCH_SIZE, shuffle=True
    )

    # Create the model
    model = CPCModel(
        input_dim=INPUT_DIM,
        latent_dim=LATENT_DIM,
        context_hidden=LATENT_DIM,
        num_layers_gru=2,
        num_prediction_steps=12,
        num_negatives=64
    )

    # Training
    trained_model = train_cpc(model, dataloader, num_epochs=NUM_EPOCHS)
    print("\nCPC training completed successfully!")

Key Hyperparameters

Hyperparameter tuning is crucial for CPC performance. Here are the four most important parameters and their recommended values:

Hyperparameter Description Typical value Impact
num_negatives Total number of candidates in the InfoNCE loss (1 positive + N-1 negatives) 128–256 More negatives improve the mutual information bound but increase memory cost
context_size Dimension of the GRU hidden state (dimension of context c_t) 256–512 A larger context captures longer-range dependencies
prediction_steps Number of future steps K to predict simultaneously 8–24 More steps capture long-term dynamics
encoder_type Architecture of the encoder g_enc (CNN, WaveNet, ResNet) CNN for audio, ResNet for vision Determines the granularity of latent representations

Additional recommendations

  • Learning rate: start at 1×10⁻³ with a Cosine Annealing scheduler. CPC is relatively robust to the choice of learning rate thanks to the self-classifying nature of the InfoNCE loss.
  • Latent dimension: 128 is a good starting point for audio, 256 for images. Higher dimensions do not always provide a net benefit.
  • Batch size: the larger the batch, the better for negative generation. Aim for at least 64 samples per batch.
  • Weight decay: 1×10⁻⁴ to 1×10⁻⁵ to prevent overfitting.

Choice of autoregressive model

The GRU is the default choice of the original CPC and remains excellent for most applications. However, a Transformer with causal attention can capture even longer dependencies at the cost of higher computational cost. For very long sequences (several thousand steps), the Transformer is often superior, especially if a linear attention mechanism or sliding window is added.


Advantages and Limitations of CPC

Advantages

  1. No labels needed: CPC is entirely self-supervised. It derives its supervision from the intrinsic temporal structure of the data itself. This is a considerable advantage when annotations are rare, expensive, or simply nonexistent.
  2. Rich and transferable representations: Representations learned by CPC have proven extremely effective for transfer to downstream tasks. The original paper demonstrated that representations pre-trained with CPC outperformed models trained in a supervised manner on automatic speech recognition tasks.
  3. Architectural modularity: The encoder can be adapted to almost any type of data — CNN for audio, ResNet for images, embeddings for text. The autoregressive model and the InfoNCE loss remain unchanged.
  4. Computational efficiency: Compared to generative methods such as VAEs or GANs, CPC is relatively lightweight to train because it does not require a complex decoder or a sample generation phase.
  5. Solid theoretical foundation: The formal connection with mutual information maximization gives CPC theoretical guarantees that many other self-supervised methods lack.

Limitations

  1. Dependence on sequential structure: CPC implicitly assumes that there is an exploitable temporal structure in the data. For i.i.d. (independent and identically distributed) data without a meaningful temporal order, CPC offers no advantage over other contrastive methods.
  2. Sensitivity to batch bias: Negative generation depends on the composition of the training batch. If the batch contains very similar samples, the negatives may be too easy, weakening the learning signal.
  3. Memory cost for large batches: The need for many negative samples implies large batch sizes, which can become prohibitive in GPU memory, especially for long sequences.
  4. Difficulty with very long dependencies: Although the autoregressive model can theoretically model arbitrarily long dependencies, practical GRUs have limited effective memory. Distant prediction steps (high k) are therefore often less well learned.

4 Concrete Use Cases

Use Case 1: Automatic Speech Recognition (ASR)

This is the flagship application of CPC. By pre-training a CPC encoder on thousands of hours of untranscribed audio, representations are obtained that capture phonemes, syllables, and even certain syntactic aspects. A lightweight classifier (e.g., a small linear network) grafted onto these representations achieves remarkable speech recognition performance, even with very little labeled data for fine-tuning.

Concrete example: pre-train CPC on Mozilla’s Common Voice corpus (thousands of hours of free speech in many languages), then fine-tune with only 100 hours of transcription for a new dialect. Results are often superior to supervised training from scratch on those 100 hours.

Use Case 2: Music Genre Classification

Music has a rich temporal structure — melody, harmony, rhythm, timbre — that CPC is particularly well suited to capture. By applying CPC to musical spectrograms, the model learns to distinguish the characteristic patterns of each musical genre without knowing the genre labels during training.

Concrete example: encode 30-second music excerpts with a pre-trained CPC, then use these representations to feed a k-NN classifier or a random forest. This approach achieves accuracies in the range of 85–90% on multi-genre classification tasks, surpassing methods based on manual features such as MFCCs.

Use Case 3: Anomaly Detection in Industrial Time Series

In industrial environments, sensors produce continuous streams of data (temperature, pressure, vibration, current). CPC can learn the normal dynamics of the system during a pre-training phase. During deployment, any significant deviation between the context prediction and the actual observation signals a potential anomaly.

Concrete example: on the Numenta Anomaly Benchmark (NAB) dataset, a CPC-based detector identifies anomalies with significantly lower latency than classical statistical methods (CUSUM, EWMA), while producing fewer false alarms.

Use Case 4: Self-Supervised Visual Representations

Although CPC was designed for sequences, it also applies to images by treating them as sequences of patches. The image is divided into a grid of regions (e.g., 7×7), each region is encoded with a small CNN, and the autoregressive model processes these patches in a scanning order (left to right, top to bottom). Predicting future patches forces the model to learn rich visual features.

Concrete example: CPC-v2 applied to ImageNet without labels produces representations that, when transferred to a classification task with limited labels, outperform models pre-trained in a supervised manner on ImageNet. This approach paved the way for modern methods such as SimCLR and MoCo.


See Also