Artificial Intelligence

Human Evaluation

Using human judges to assess AI model quality on subjective dimensions like helpfulness, coherence, creativity, and safety that automated metrics cannot fully capture.

Why It Matters

Human evaluation remains the gold standard for assessing LLM quality. Models can score well on benchmarks but feel unhelpful or unsafe to actual users.

Example

Having 500 raters compare responses from Model A and Model B across 1,000 questions, rating each for helpfulness, accuracy, and safety on a 5-point scale.

Think of it like...

Like restaurant reviews from actual diners versus food safety inspection scores — the numbers tell one story, but real user experience tells another.

Related Terms