Artificial Intelligence

Benchmark Contamination

When a model's training data inadvertently includes test data from benchmarks, leading to inflated performance scores that do not reflect true capability.

Why It Matters

Benchmark contamination undermines the entire evaluation ecosystem. Models may appear to improve when they have simply memorized the test answers.

Example

A model scoring 95% on a coding benchmark because the exact solutions were in its training data, versus 70% on truly novel problems — the 25% gap is contamination.

Think of it like...

Like a student who somehow got the exam questions in advance — their great score does not reflect actual knowledge.

Related Terms