Artificial Intelligence

Attention Window

The range of tokens that an attention mechanism can attend to in a single computation. Different attention patterns (local, global, sliding) use different window sizes.

Why It Matters

Attention window design determines the tradeoff between context length and computational efficiency — a key architectural decision for long-context models.

Example

A sliding window attention of 4,096 tokens means each token can attend to 4,096 nearby tokens, with global attention tokens providing coverage of the full sequence.

Think of it like...

Like the field of vision when driving — you focus on what is nearby (local attention) while periodically checking mirrors for the broader picture (global attention).

Related Terms