Machine Learning

Data Parallelism

A distributed training approach where the training data is split across multiple GPUs, each holding a complete copy of the model. Gradients are averaged across GPUs after each batch.

Why It Matters

Data parallelism is the simplest and most common way to speed up training. It scales nearly linearly — 8 GPUs can train roughly 8x faster.

Example

Splitting a batch of 256 images across 8 GPUs (32 per GPU), each computing gradients independently, then averaging to update the shared model.

Think of it like...

Like a teacher giving different practice problems to each student but using the same textbook — everyone learns from different data but contributes to the same understanding.

Data Parallelism

Why It Matters

Example

Think of it like...

Related Terms

Distributed Training

Model Parallelism

GPU

Batch Size

Gradient Accumulation