Masked Language Model
A training approach where random tokens in the input are replaced with a special [MASK] token and the model learns to predict the original tokens from context. This is how BERT was pre-trained.
Why It Matters
Masked language modeling enables bidirectional understanding — the model learns from both left and right context simultaneously, producing richer representations.
Example
Input: 'The [MASK] sat on the mat.' The model learns to predict 'cat' by understanding the surrounding context from both directions.
Think of it like...
Like a fill-in-the-blank test where you use surrounding clues to figure out the missing word — the more context you consider, the better your guess.
Related Terms
BERT
Bidirectional Encoder Representations from Transformers — a language model developed by Google that reads text in both directions simultaneously. BERT excels at understanding language rather than generating it.
Self-Supervised Learning
A training approach where the model generates its own labels from the data, typically by masking or hiding parts of the input and learning to predict them. No human-annotated labels are needed.
Pre-training
The initial phase of training a model on a large, general-purpose dataset before specializing it for specific tasks. Pre-training gives the model broad knowledge and capabilities.