Vanishing Gradient Problem
A training difficulty in deep networks where gradients become exponentially smaller as they are propagated back through many layers, making it nearly impossible for early layers to learn.
Why It Matters
The vanishing gradient problem limited neural networks to shallow architectures for decades. Solutions like ReLU, LSTMs, and residual connections unlocked deep learning.
Example
In a 50-layer network, the gradient signal reaching the first layer might be 0.0000001 — effectively zero — meaning those early layers never update their weights.
Think of it like...
Like playing telephone with 50 people — the message gets so distorted by the end that the first person cannot effectively adjust what they said based on the final output.
Related Terms
Backpropagation
The primary algorithm used to train neural networks. It calculates how much each weight in the network contributed to the error, then adjusts weights backward from the output layer to reduce future errors.
ReLU
Rectified Linear Unit — the most commonly used activation function in deep learning. It outputs the input directly if positive, and zero otherwise: f(x) = max(0, x).
Residual Connection
A shortcut that allows the input to a layer to bypass one or more layers and be added directly to the output. This enables training of much deeper networks by ensuring gradient flow.
Batch Normalization
A technique that normalizes the inputs to each layer in a neural network by adjusting and scaling them to have zero mean and unit variance. This stabilizes and accelerates the training process.