Data Parallelism
A distributed training approach where the training data is split across multiple GPUs, each holding a complete copy of the model. Gradients are averaged across GPUs after each batch.
Why It Matters
Data parallelism is the simplest and most common way to speed up training. It scales nearly linearly — 8 GPUs can train roughly 8x faster.
Example
Splitting a batch of 256 images across 8 GPUs (32 per GPU), each computing gradients independently, then averaging to update the shared model.
Think of it like...
Like a teacher giving different practice problems to each student but using the same textbook — everyone learns from different data but contributes to the same understanding.
Related Terms
Distributed Training
Splitting model training across multiple GPUs or machines to handle larger models or datasets and reduce training time. Techniques include data parallelism and model parallelism.
Model Parallelism
A distributed training approach where the model itself is split across multiple GPUs, with each GPU holding and computing a different portion of the model.
GPU
Graphics Processing Unit — originally designed for rendering graphics, GPUs excel at the parallel mathematical operations needed for training and running AI models. They are the primary hardware for modern AI.
Batch Size
The number of training examples processed together before the model updates its parameters. Batch size affects training speed, memory usage, and how smoothly the model learns.
Gradient Accumulation
A technique that simulates larger batch sizes by accumulating gradients over multiple forward passes before performing a single weight update. This enables large effective batch sizes on limited hardware.