Variational Autoencoder: Complete Guide — Variational Autoencoder
Summary — The VAE (Variational Autoencoder) is a generative model introduced by Kingma and Welling in 2013. Unlike the classical autoencoder which learns a deterministic representation, the VAE learns a probability distribution in the latent space. This probabilistic approach makes it possible not only to compress data but also to generate new data by sampling in the latent space. The VAE is one of the pillars of deep generative learning, alongside GANs and diffusion models.
Mathematical Principle
1. From the deterministic autoencoder to the probabilistic VAE
A classical autoencoder learns two functions:
– Encoder: h = f_θ(x) → a fixed point in the latent space
– Decoder: r = g_φ(h) → reconstruction
The problem: this latent space is discontinuous and unstructured. Two close points in the latent space can decode into completely different images. You cannot reliably generate new data.
The VAE solves this problem by making the encoder a distribution generator: instead of producing a single vector h, it produces a Gaussian distribution.
2. The variational encoder
The VAE encoder produces two vectors for each input x:
– μ(x): mean of the latent distribution
– σ(x): standard deviation of the latent distribution
The latent representation is then a random sample:
$$z \sim \mathcal{N}(\mu(x), \sigma^2(x) \cdot I)$$
This means that z is no longer a fixed point, but a point drawn randomly according to a Gaussian centered on μ(x) with radius σ(x).
3. Reparameterization trick
The problem is that the sampling operation z ~ N(μ, σ²) is not differentiable, which prevents gradient backpropagation.
The solution is the reparameterization trick:
$$z = \mu(x) + \sigma(x) \cdot \varepsilon \quad \text{where} \quad \varepsilon \sim \mathcal{N}(0, 1)$$
Now, the randomness is isolated in ε (which does not depend on the model parameters), and the transformation z = μ + σ·ε is differentiable with respect to μ and σ. The gradient can therefore flow through.
4. Loss function: ELBO
The VAE maximizes the Evidence Lower Bound (ELBO):
$$\mathcal{L}(\theta, \phi; x) = \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] – \text{KL}(q_\phi(z|x) || p(z))$$
This formula breaks down into two terms:
Reconstruction term: E[log p_θ(x|z)]
This is the fidelity of the reconstruction. The better the decoder can reconstruct x from z, the better this term. In practice, MSE or binary cross-entropy is used.
KL divergence term: KL(q_φ(z|x) || p(z))
This is a regularization that forces the latent distribution q(z|x) to be close to a prior distribution p(z) = N(0, I).
For Gaussians, the KL divergence has a closed form:
$$\text{KL} = \frac{1}{2} \sum_{j=1}^{d} (\sigma_j^2 + \mu_j^2 – \log(\sigma_j^2) – 1)$$
The total loss function to minimize is therefore:
$$\text{Loss} = \text{Reconstruction} + \beta \cdot \text{KL}$$
where β is a hyperparameter (often β = 1 for the standard VAE).
5. Generating new data
Once trained, you can generate new data by sampling directly from the prior distribution:
$$z_{new} \sim \mathcal{N}(0, I) \quad \rightarrow \quad x_{new} = g_\phi(z_{new})$$
Since the encoder has learned to regularize the latent distributions toward N(0, I), the decoder knows how to interpret any point in this space.
Intuition
Imagine a cartographer who needs to draw a map of a continent.
The classical autoencoder places each city at a precise point on the map. If you ask “what is 0.1 km northeast of Lyon?” it has no answer — it only learned discrete points. The map has gaps between known cities.
The VAE, on the other hand, does not place a point but draws a circle of uncertainty around each city. The circles overlap, creating a continuous map where every point on the map corresponds to something plausible. If you point to a location between Lyon and Saint-Étienne, the VAE can describe a plausible city that could exist there.
It is the difference between a phone book (autoencoder: one fixed entry per person) and a road map (VAE: a continuous space where every point has meaning).
Python Implementation
Example 1: Basic VAE with Keras
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
# Parameters
latent_dim = 2
input_shape = (28, 28, 1) # MNIST
# Encoder
inputs = keras.Input(shape=input_shape)
x = layers.Flatten()(inputs)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dense(128, activation='relu')(x)
z_mean = layers.Dense(latent_dim, name='z_mean')(x)
z_log_var = layers.Dense(latent_dim, name='z_log_var')(x)
# Reparameterization trick
class Sampling(layers.Layer):
def call(self, inputs):
z_mean, z_log_var = inputs
batch = tf.shape(z_mean)[0]
dim = tf.shape(z_mean)[1]
epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
return z_mean + tf.exp(0.5 * z_log_var) * epsilon
z = Sampling()([z_mean, z_log_var])
encoder = models.Model(inputs, [z_mean, z_log_var, z], name='encoder')
# Decoder
latent_inputs = keras.Input(shape=(latent_dim,))
x = layers.Dense(128, activation='relu')(latent_inputs)
x = layers.Dense(256, activation='relu')(x)
outputs = layers.Dense(784, activation='sigmoid')(x)
outputs = layers.Reshape((28, 28, 1))(outputs)
decoder = models.Model(latent_inputs, outputs, name='decoder')
# Full VAE model
outputs_comb = decoder(z)
vae = models.Model(inputs, outputs_comb, name='vae')
# Custom loss
reconstruction_loss = tf.reduce_mean(
keras.losses.binary_crossentropy(
tf.reshape(inputs, [-1, 784]),
tf.reshape(outputs_comb, [-1, 784])
)
) * 784
kl_loss = -0.5 * tf.reduce_mean(
tf.reduce_sum(
1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var),
axis=1
)
)
total_loss = reconstruction_loss + kl_loss
vae.add_loss(total_loss)
# Compilation
vae.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-3))
vae.summary()
Example 2: Training on MNIST
# Loading data
(x_train, _), (x_test, _) = keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1).astype('float32') / 255.0
x_test = np.expand_dims(x_test, -1).astype('float32') / 255.0
# Training
history = vae.fit(
x_train,
epochs=20,
batch_size=128,
validation_data=(x_test, None)
)
# Image generation
# Sampling in the latent space
n_samples = 10
z_samples = np.random.normal(size=(n_samples, latent_dim))
generated_imgs = decoder.predict(z_samples)
# Display generated images
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, n_samples, figsize=(15, 3))
for i, ax in enumerate(axes):
ax.imshow(generated_imgs[i].squeeze(), cmap='gray')
ax.axis('off')
plt.suptitle('Images generated by the VAE')
plt.tight_layout()
plt.savefig('vae_generated.png', dpi=150)
Example 3: 2D Latent Space Visualization
# Encode the entire test set
encoded = encoder.predict(x_test)
z_means = encoded[0] # (10000, 2)
fig, ax = plt.subplots(figsize=(10, 10))
# Load labels for coloring
(_, y_test), _ = keras.datasets.mnist.load_data()
scatter = ax.scatter(z_means[:, 0], z_means[:, 1], c=y_test,
cmap='tab10', alpha=0.6, s=5)
plt.colorbar(scatter, ticks=range(10), label='Digit')
ax.set_title('VAE Latent Space - MNIST (2D)')
ax.set_xlabel('z1')
ax.set_ylabel('z2')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('vae_latent_space.png', dpi=150)
Example 4: Interpolation in Latent Space
# Take two encoded images and interpolate between them
z1 = z_means[0:1] # First latent point
z2 = z_means[1:2] # Second latent point
n_steps = 20
interp_z = np.linspace(z1, z2, n_steps)
interp_imgs = decoder.predict(interp_z)
# Display the interpolation
fig, axes = plt.subplots(2, 10, figsize=(15, 4))
for i, ax in enumerate(axes.flatten()):
ax.imshow(interp_imgs[i].squeeze(), cmap='gray')
ax.axis('off')
plt.suptitle('Interpolation in the VAE latent space')
plt.tight_layout()
plt.savefig('vae_interpolation.png', dpi=150)
Hyperparameters
| Hyperparameter | Typical value | Description |
|---|---|---|
latent_dim |
2-512 | Dimension of the latent space. 2 for visualization, 32-128 for generation |
beta (β) |
0.1-4 | Weight of the KL divergence. Larger = more regularized latent space but less accurate reconstruction |
architecture |
Dense or Conv | Dense for flat data, Conv2D for images |
optimizer |
Adam | Adam is the standard choice. lr = 1e-3 |
epochs |
50-200 | More epochs = better reconstruction, but risk of overfitting |
batch_size |
64-256 | Standard for stable convergence |
Advantages of VAE
- Generating new data: The VAE can create realistic samples by sampling in the latent space. It is a generative model, not just a compressor.
- Continuous and structured latent space: Unlike the classical autoencoder, the VAE’s latent space is continuous: you can navigate, interpolate, and explore meaningfully.
- Analytical loss function: The KL divergence between Gaussians has a closed form, making training stable and fast, without the unstable adversarial terms of GANs.
- Interpretability of the latent space: With a latent_dim = 2, you can directly visualize the latent space. For higher dimensions, correlation analyses reveal which dimensions encode specific features.
- Varied applications: Beyond image generation, VAEs are used for anomaly detection, drug discovery, and image compression.
Limitations of VAE
- Blurry generated images: Minimizing MSE produces average images rather than sharp images. VAE generations are often blurrier than GAN outputs.
- Posterior collapse: Sometimes the decoder learns to ignore the encoder and the KL drops to zero. The VAE is then reduced to a decoder that generates images without using the latent space.
- Choice of prior distribution: The choice of N(0, I) as a prior distribution is convenient but can be too restrictive for complex data with multimodal structures.
- Difficult to evaluate: Unlike a classifier where accuracy is clear, evaluating the quality of a VAE’s generations requires complex metrics like Inception Score or FID.
- Less performant than GANs in visual quality: GANs produce sharper and more realistic images, at the cost of more unstable training.
4 Concrete Use Cases
1. Human Face Generation
VAEs trained on face datasets (CelebA, FFHQ) learn a structured latent space where each dimension encodes a feature: smile, orientation, age, lighting. By manipulating these dimensions, you can generate realistic faces with controllable attributes.
2. Anomaly Detection by Reconstruction
A VAE trained only on normal data learns to reconstruct normal patterns well. When an anomaly (a machine failure, a fraud, an abnormal medical image) is presented, the reconstruction error is high — signaling the anomaly.
3. Drug Discovery (Molecule Design)
In computational chemistry, VAEs encode molecular representations (SMILES strings) in a continuous latent space. You can then navigate this space to find new molecules with desired properties (solubility, efficacy, reduced toxicity).
4. Image Compression with Generation
Although less efficient than JPEG for pure compression, VAE offers “intelligent” compression: instead of storing pixels, you store the latent parameters (μ, σ). Decompression generates a plausible approximation of the original image, sometimes more natural than a compressed JPEG decompression.
Conclusion
The VAE is a major contribution to generative learning: for the first time, a deep learning model learns a continuous, structured, and generative latent space. The reparameterization trick is a simple but profound innovation that opened the way to an entire family of variational models.
Even though GANs surpass VAEs in pure visual quality, VAEs remain unmatched for latent space interpretability and training stability. And in the 2020s, VAEs experienced a revival with DeepMind’s “VQ-VAEs” that combine discretization and generation to produce GAN-quality images.
See Also
- Mastering Number Steps in Python: Complete Guide to Optimizing Your Algorithms
- Circle Packing II in Python: Advanced Techniques and Optimized Solutions

