Attention Sink
A phenomenon in transformers where the first few tokens in a sequence receive disproportionately high attention scores regardless of their content, acting as 'sinks' for excess attention.
Why It Matters
Understanding attention sinks helps improve model efficiency and enables techniques like StreamingLLM that maintain performance over very long sequences.
Example
The beginning-of-sequence token receiving high attention scores in every layer even though it carries no semantic information — it acts as a default attention target.
Think of it like...
Like a default option in a survey that people select when they are not sure — it absorbs attention that does not have a better target.
Related Terms
Attention Mechanism
A component in neural networks that allows the model to focus on the most relevant parts of the input when producing each part of the output. It assigns different weights to different input elements based on their relevance.
Self-Attention
A mechanism where each element in a sequence attends to all other elements to compute a representation, determining how much focus to place on each part of the input. It is the core innovation of the transformer.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.
Long Context
The ability of AI models to process very large amounts of input text — typically 100K tokens or more — enabling analysis of entire books, codebases, or document collections.