Artificial Intelligence

Mixture of Depths

A transformer architecture where different tokens use different numbers of layers, allowing the model to spend more computation on complex tokens and less on simple ones.

Why It Matters

MoD makes transformers more efficient by routing easy computations through fewer layers, reducing average inference cost while maintaining quality.

Example

The word 'the' skipping most transformer layers (it is simple) while the word 'paradoxically' passes through all layers (it requires more processing).

Think of it like...

Like an express lane at the airport — passengers with simple cases go through quickly while complex cases get more thorough processing.

Mixture of Depths

Why It Matters

Example

Think of it like...

Related Terms

Transformer

Mixture of Experts

Sparse Model