Self-Attention
A mechanism where each element in a sequence attends to all other elements to compute a representation, determining how much focus to place on each part of the input. It is the core innovation of the transformer.
Why It Matters
Self-attention is what allows transformers to understand context and relationships across an entire document, not just nearby words. It is arguably the most important mechanism in modern AI.
Example
In 'The animal didn't cross the street because it was too tired,' self-attention helps the model understand that 'it' refers to 'the animal' (not the street).
Think of it like...
Like a group discussion where each person considers what everyone else said before forming their own contribution — everyone's input influences everyone else.
Related Terms
Attention Mechanism
A component in neural networks that allows the model to focus on the most relevant parts of the input when producing each part of the output. It assigns different weights to different input elements based on their relevance.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.
Multi-Head Attention
An extension of attention where multiple attention mechanisms (heads) run in parallel, each learning to focus on different types of relationships in the data. The outputs are then combined.
Positional Encoding
A technique used in transformers to inject information about the position of each token in a sequence. Since transformers process all tokens in parallel, they need explicit position information.