Perplexity
A metric that measures how well a language model predicts text. Lower perplexity indicates the model is less 'surprised' by the text, meaning it can predict the next token more accurately.
Why It Matters
Perplexity is the standard intrinsic evaluation metric for language models. It enables quick comparison of model quality during development without expensive human evaluation.
Example
A model with perplexity of 10 on a text is equivalent to being uncertain between 10 equally likely next words at each step — lower is better.
Think of it like...
Like a guessing game where you predict the next word in a sentence — a perplexity of 5 means you are as confused as if you had to choose between 5 equally likely options.
Related Terms
Evaluation
The systematic process of measuring an AI model's performance, safety, and reliability using various metrics, benchmarks, and testing methodologies.
Cross-Entropy
A loss function commonly used in classification tasks that measures the difference between the predicted probability distribution and the actual distribution. Lower cross-entropy means better predictions.
Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models. Benchmarks provide consistent metrics that allow fair comparisons between different approaches.
Token
The basic unit of text that language models process. A token can be a word, part of a word, or a punctuation mark. Text is broken into tokens before being fed into an LLM, and the model generates output one token at a time.