Retrieval Evaluation
Methods for measuring how well a retrieval system finds relevant documents. Key metrics include recall at K, mean reciprocal rank, and normalized discounted cumulative gain.
Why It Matters
Retrieval evaluation is the overlooked half of RAG quality. You can have a perfect LLM, but if retrieval returns wrong documents, answers will be wrong.
Example
Testing a RAG system on 500 questions where you know the correct source documents, measuring that it retrieves the right document in the top 5 results 85% of the time.
Think of it like...
Like grading a research assistant on whether they pulled the right files from the cabinet, before evaluating what they wrote with those files.
Related Terms
Retrieval-Augmented Generation
A technique that enhances LLM outputs by first retrieving relevant information from external knowledge sources and then using that information as context for generation. RAG combines the power of search with the fluency of language models.
Evaluation
The systematic process of measuring an AI model's performance, safety, and reliability using various metrics, benchmarks, and testing methodologies.
Precision
Of all the items the model predicted as positive, the proportion that were actually positive. Precision measures how trustworthy the model's positive predictions are.
Recall
Of all the actually positive items in the dataset, the proportion that the model correctly identified. Recall measures how completely the model finds all relevant items.
Semantic Search
Search that understands the meaning and intent behind a query rather than just matching keywords. It uses embeddings to find results that are conceptually related even if they use different words.