VAE (Variational Autoencoder): Variational Autoencoder

VAE (Variational Autoencoder) : Guide Complet — Autoencodeur Variationnel

Variational Autoencoder: Complete Guide — Variational Autoencoder

Summary — The VAE (Variational Autoencoder) is a generative model introduced by Kingma and Welling in 2013. Unlike the classical autoencoder which learns a deterministic representation, the VAE learns a probability distribution in the latent space. This probabilistic approach makes it possible not only to compress data but also to generate new data by sampling in the latent space. The VAE is one of the pillars of deep generative learning, alongside GANs and diffusion models.


Mathematical Principle

1. From the deterministic autoencoder to the probabilistic VAE

A classical autoencoder learns two functions:
Encoder: h = f_θ(x) → a fixed point in the latent space
Decoder: r = g_φ(h) → reconstruction

The problem: this latent space is discontinuous and unstructured. Two close points in the latent space can decode into completely different images. You cannot reliably generate new data.

The VAE solves this problem by making the encoder a distribution generator: instead of producing a single vector h, it produces a Gaussian distribution.

2. The variational encoder

The VAE encoder produces two vectors for each input x:
μ(x): mean of the latent distribution
σ(x): standard deviation of the latent distribution

The latent representation is then a random sample:

$$z \sim \mathcal{N}(\mu(x), \sigma^2(x) \cdot I)$$

This means that z is no longer a fixed point, but a point drawn randomly according to a Gaussian centered on μ(x) with radius σ(x).

3. Reparameterization trick

The problem is that the sampling operation z ~ N(μ, σ²) is not differentiable, which prevents gradient backpropagation.

The solution is the reparameterization trick:

$$z = \mu(x) + \sigma(x) \cdot \varepsilon \quad \text{where} \quad \varepsilon \sim \mathcal{N}(0, 1)$$

Now, the randomness is isolated in ε (which does not depend on the model parameters), and the transformation z = μ + σ·ε is differentiable with respect to μ and σ. The gradient can therefore flow through.

4. Loss function: ELBO

The VAE maximizes the Evidence Lower Bound (ELBO):

$$\mathcal{L}(\theta, \phi; x) = \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] – \text{KL}(q_\phi(z|x) || p(z))$$

This formula breaks down into two terms:

Reconstruction term: E[log p_θ(x|z)]
This is the fidelity of the reconstruction. The better the decoder can reconstruct x from z, the better this term. In practice, MSE or binary cross-entropy is used.

KL divergence term: KL(q_φ(z|x) || p(z))
This is a regularization that forces the latent distribution q(z|x) to be close to a prior distribution p(z) = N(0, I).

For Gaussians, the KL divergence has a closed form:

$$\text{KL} = \frac{1}{2} \sum_{j=1}^{d} (\sigma_j^2 + \mu_j^2 – \log(\sigma_j^2) – 1)$$

The total loss function to minimize is therefore:

$$\text{Loss} = \text{Reconstruction} + \beta \cdot \text{KL}$$

where β is a hyperparameter (often β = 1 for the standard VAE).

5. Generating new data

Once trained, you can generate new data by sampling directly from the prior distribution:

$$z_{new} \sim \mathcal{N}(0, I) \quad \rightarrow \quad x_{new} = g_\phi(z_{new})$$

Since the encoder has learned to regularize the latent distributions toward N(0, I), the decoder knows how to interpret any point in this space.


Intuition

Imagine a cartographer who needs to draw a map of a continent.

The classical autoencoder places each city at a precise point on the map. If you ask “what is 0.1 km northeast of Lyon?” it has no answer — it only learned discrete points. The map has gaps between known cities.

The VAE, on the other hand, does not place a point but draws a circle of uncertainty around each city. The circles overlap, creating a continuous map where every point on the map corresponds to something plausible. If you point to a location between Lyon and Saint-Étienne, the VAE can describe a plausible city that could exist there.

It is the difference between a phone book (autoencoder: one fixed entry per person) and a road map (VAE: a continuous space where every point has meaning).


Python Implementation

Example 1: Basic VAE with Keras

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models

# Parameters
latent_dim = 2
input_shape = (28, 28, 1)  # MNIST

# Encoder
inputs = keras.Input(shape=input_shape)
x = layers.Flatten()(inputs)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dense(128, activation='relu')(x)
z_mean = layers.Dense(latent_dim, name='z_mean')(x)
z_log_var = layers.Dense(latent_dim, name='z_log_var')(x)

# Reparameterization trick
class Sampling(layers.Layer):
    def call(self, inputs):
        z_mean, z_log_var = inputs
        batch = tf.shape(z_mean)[0]
        dim = tf.shape(z_mean)[1]
        epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
        return z_mean + tf.exp(0.5 * z_log_var) * epsilon

z = Sampling()([z_mean, z_log_var])
encoder = models.Model(inputs, [z_mean, z_log_var, z], name='encoder')

# Decoder
latent_inputs = keras.Input(shape=(latent_dim,))
x = layers.Dense(128, activation='relu')(latent_inputs)
x = layers.Dense(256, activation='relu')(x)
outputs = layers.Dense(784, activation='sigmoid')(x)
outputs = layers.Reshape((28, 28, 1))(outputs)
decoder = models.Model(latent_inputs, outputs, name='decoder')

# Full VAE model
outputs_comb = decoder(z)
vae = models.Model(inputs, outputs_comb, name='vae')

# Custom loss
reconstruction_loss = tf.reduce_mean(
    keras.losses.binary_crossentropy(
        tf.reshape(inputs, [-1, 784]),
        tf.reshape(outputs_comb, [-1, 784])
    )
) * 784
kl_loss = -0.5 * tf.reduce_mean(
    tf.reduce_sum(
        1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var),
        axis=1
    )
)
total_loss = reconstruction_loss + kl_loss
vae.add_loss(total_loss)

# Compilation
vae.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-3))
vae.summary()

Example 2: Training on MNIST

# Loading data
(x_train, _), (x_test, _) = keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1).astype('float32') / 255.0
x_test = np.expand_dims(x_test, -1).astype('float32') / 255.0

# Training
history = vae.fit(
    x_train,
    epochs=20,
    batch_size=128,
    validation_data=(x_test, None)
)

# Image generation
# Sampling in the latent space
n_samples = 10
z_samples = np.random.normal(size=(n_samples, latent_dim))
generated_imgs = decoder.predict(z_samples)

# Display generated images
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, n_samples, figsize=(15, 3))
for i, ax in enumerate(axes):
    ax.imshow(generated_imgs[i].squeeze(), cmap='gray')
    ax.axis('off')
plt.suptitle('Images generated by the VAE')
plt.tight_layout()
plt.savefig('vae_generated.png', dpi=150)

Example 3: 2D Latent Space Visualization

# Encode the entire test set
encoded = encoder.predict(x_test)
z_means = encoded[0]  # (10000, 2)

fig, ax = plt.subplots(figsize=(10, 10))
# Load labels for coloring
(_, y_test), _ = keras.datasets.mnist.load_data()
scatter = ax.scatter(z_means[:, 0], z_means[:, 1], c=y_test,
    cmap='tab10', alpha=0.6, s=5)
plt.colorbar(scatter, ticks=range(10), label='Digit')
ax.set_title('VAE Latent Space - MNIST (2D)')
ax.set_xlabel('z1')
ax.set_ylabel('z2')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('vae_latent_space.png', dpi=150)

Example 4: Interpolation in Latent Space

# Take two encoded images and interpolate between them
z1 = z_means[0:1]  # First latent point
z2 = z_means[1:2]  # Second latent point

n_steps = 20
interp_z = np.linspace(z1, z2, n_steps)
interp_imgs = decoder.predict(interp_z)

# Display the interpolation
fig, axes = plt.subplots(2, 10, figsize=(15, 4))
for i, ax in enumerate(axes.flatten()):
    ax.imshow(interp_imgs[i].squeeze(), cmap='gray')
    ax.axis('off')
plt.suptitle('Interpolation in the VAE latent space')
plt.tight_layout()
plt.savefig('vae_interpolation.png', dpi=150)

Hyperparameters

Hyperparameter Typical value Description
latent_dim 2-512 Dimension of the latent space. 2 for visualization, 32-128 for generation
beta (β) 0.1-4 Weight of the KL divergence. Larger = more regularized latent space but less accurate reconstruction
architecture Dense or Conv Dense for flat data, Conv2D for images
optimizer Adam Adam is the standard choice. lr = 1e-3
epochs 50-200 More epochs = better reconstruction, but risk of overfitting
batch_size 64-256 Standard for stable convergence

Advantages of VAE

  1. Generating new data: The VAE can create realistic samples by sampling in the latent space. It is a generative model, not just a compressor.
  2. Continuous and structured latent space: Unlike the classical autoencoder, the VAE’s latent space is continuous: you can navigate, interpolate, and explore meaningfully.
  3. Analytical loss function: The KL divergence between Gaussians has a closed form, making training stable and fast, without the unstable adversarial terms of GANs.
  4. Interpretability of the latent space: With a latent_dim = 2, you can directly visualize the latent space. For higher dimensions, correlation analyses reveal which dimensions encode specific features.
  5. Varied applications: Beyond image generation, VAEs are used for anomaly detection, drug discovery, and image compression.

Limitations of VAE

  1. Blurry generated images: Minimizing MSE produces average images rather than sharp images. VAE generations are often blurrier than GAN outputs.
  2. Posterior collapse: Sometimes the decoder learns to ignore the encoder and the KL drops to zero. The VAE is then reduced to a decoder that generates images without using the latent space.
  3. Choice of prior distribution: The choice of N(0, I) as a prior distribution is convenient but can be too restrictive for complex data with multimodal structures.
  4. Difficult to evaluate: Unlike a classifier where accuracy is clear, evaluating the quality of a VAE’s generations requires complex metrics like Inception Score or FID.
  5. Less performant than GANs in visual quality: GANs produce sharper and more realistic images, at the cost of more unstable training.

4 Concrete Use Cases

1. Human Face Generation

VAEs trained on face datasets (CelebA, FFHQ) learn a structured latent space where each dimension encodes a feature: smile, orientation, age, lighting. By manipulating these dimensions, you can generate realistic faces with controllable attributes.

2. Anomaly Detection by Reconstruction

A VAE trained only on normal data learns to reconstruct normal patterns well. When an anomaly (a machine failure, a fraud, an abnormal medical image) is presented, the reconstruction error is high — signaling the anomaly.

3. Drug Discovery (Molecule Design)

In computational chemistry, VAEs encode molecular representations (SMILES strings) in a continuous latent space. You can then navigate this space to find new molecules with desired properties (solubility, efficacy, reduced toxicity).

4. Image Compression with Generation

Although less efficient than JPEG for pure compression, VAE offers “intelligent” compression: instead of storing pixels, you store the latent parameters (μ, σ). Decompression generates a plausible approximation of the original image, sometimes more natural than a compressed JPEG decompression.


Conclusion

The VAE is a major contribution to generative learning: for the first time, a deep learning model learns a continuous, structured, and generative latent space. The reparameterization trick is a simple but profound innovation that opened the way to an entire family of variational models.

Even though GANs surpass VAEs in pure visual quality, VAEs remain unmatched for latent space interpretability and training stability. And in the 2020s, VAEs experienced a revival with DeepMind’s “VQ-VAEs” that combine discretization and generation to produce GAN-quality images.


See Also