Mixture of Depths
A transformer architecture where different tokens use different numbers of layers, allowing the model to spend more computation on complex tokens and less on simple ones.
Why It Matters
MoD makes transformers more efficient by routing easy computations through fewer layers, reducing average inference cost while maintaining quality.
Example
The word 'the' skipping most transformer layers (it is simple) while the word 'paradoxically' passes through all layers (it requires more processing).
Think of it like...
Like an express lane at the airport — passengers with simple cases go through quickly while complex cases get more thorough processing.
Related Terms
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.
Mixture of Experts
An architecture where a model consists of multiple specialized sub-networks (experts) and a gating mechanism that routes each input to only the most relevant experts. Only a fraction of the total parameters are active per input.
Sparse Model
A neural network where most parameters are zero or inactive for any given input. Sparse models achieve high capacity with lower computational cost by only using relevant parameters.