Model Parallelism
A distributed training approach where the model itself is split across multiple GPUs, with each GPU holding and computing a different portion of the model.
Why It Matters
Model parallelism enables training models too large to fit on a single GPU. Without it, trillion-parameter models would be impossible.
Example
Splitting a 175B parameter model across 8 GPUs, with each GPU holding ~22B parameters. Different layers or components run on different devices.
Think of it like...
Like a factory assembly line where different workers handle different stages of production — the product (data) moves through workers (GPUs), each performing their specialized step.
Related Terms
Distributed Training
Splitting model training across multiple GPUs or machines to handle larger models or datasets and reduce training time. Techniques include data parallelism and model parallelism.
Data Parallelism
A distributed training approach where the training data is split across multiple GPUs, each holding a complete copy of the model. Gradients are averaged across GPUs after each batch.
GPU
Graphics Processing Unit — originally designed for rendering graphics, GPUs excel at the parallel mathematical operations needed for training and running AI models. They are the primary hardware for modern AI.