How to Evaluate Retrieval Quality in RAG Pipelines
Precision@k, Recall@k, and F1@k Metrics

Introduction
Understanding and evaluating the retrieval mechanism in RAG (Retrieval-Augmented Generation) pipelines is crucial for building effective AI systems. This guide explores key metrics for evaluating retrieval performance.
Why Care About Retrieval Performance?
The goal is to evaluate how well embedding models and vector databases retrieve candidate text chunks. The key question: "Are the right documents in the top-k retrieved set?"
Binary vs Graded Relevance Measures
Binary Measures
- Characterize chunks as relevant or irrelevant
- No middle ground
- Simpler to implement and interpret
Graded Measures
- Assign relevance values on a spectrum
- More nuanced evaluation
- Better for complex retrieval tasks
Key Evaluation Situations
When using binary measures, we can encounter:
- True Positive (TP): Retrieved and relevant ✓
- False Positive (FP): Retrieved but irrelevant ✗
- True Negative (TN): Not retrieved and not relevant ✓
- False Negative (FN): Not retrieved but relevant ✗
Order-Unaware Binary Measures
HitRate@K
Definition: Indicates whether at least one relevant result exists in top-k retrieved chunks.
Formula:
HitRate@K = 1 if (relevant docs in top-k > 0) else 0
Use Case: Simplest metric for basic retrieval validation
Recall@K
Definition: How many of the relevant documents appear within the top-k retrieved documents.
Formula:
Recall@K = (Relevant docs in top-k) / (Total relevant docs)
Characteristics:
- Ranges from 0 to 1
- Focus on quantity of retrieved results
- Answers: "Out of all relevant items, how many did we find?"
When to Use: When you need to find as many relevant results as possible, even if some irrelevant ones are included.
Precision@K
Definition: How many of the top-k retrieved documents are actually relevant.
Formula:
Precision@K = (Relevant docs in top-k) / k
Characteristics:
- Ranges from 0 to 1
- Focus on quality of retrieved results
- Answers: "Out of what we retrieved, how many are correct?"
When to Use: When quality matters more than quantity - you want high confidence in each result.
F1@K
Definition: Harmonic mean of Precision@K and Recall@K, balancing both metrics.
Formula:
F1@K = 2 * (Precision@K * Recall@K) / (Precision@K + Recall@K)
Characteristics:
- Ranges from 0 to 1
- High only when both precision and recall are high
- Balanced evaluation metric
When to Use: When you need both accurate AND comprehensive results.
Practical Example: War and Peace
Example using YAMNet embeddings and FAISS for vector search:
# Retrieval evaluation metrics
def hit_rate_at_k(retrieved_docs, ground_truth_texts, k):
for doc in retrieved_docs[:k]:
if any(gt in doc.page_content for gt in ground_truth_texts):
return True
return False
def precision_at_k(retrieved_docs, ground_truth_texts, k):
hits = sum(1 for doc in retrieved_docs[:k]
if any(gt in doc.page_content for gt in ground_truth_texts))
return hits / k
def recall_at_k(retrieved_docs, ground_truth_texts, k):
matched = set()
for i, gt in enumerate(ground_truth_texts):
for doc in retrieved_docs[:k]:
if gt in doc.page_content:
matched.add(i)
break
return len(matched) / len(ground_truth_texts)
def f1_at_k(precision, recall):
if (precision + recall) == 0:
return 0
return 2 * precision * recall / (precision + recall)
Implementation Tips
Ground Truth Creation
- Define test queries
- Identify truly relevant chunks for each query
- Create ground truth dictionary mapping queries to relevant chunks
Evaluation Process
- Run retrieval for test queries
- Calculate metrics for each query
- Average results across all queries
- Analyze patterns and failure modes
Example Results
For query "Who is Anna Pávlovna?" with k=10:
- Hit@10: True (at least one relevant chunk found)
- Precision@10: 0.20 (2 out of 10 retrieved were relevant)
- Recall@10: 0.67 (found 2 out of 3 relevant chunks)
- F1@10: 0.31 (moderate performance due to low precision)
Best Practices
- Multiple Queries: Evaluate across diverse query types
- Varied K Values: Test with different k to understand trade-offs
- Regular Evaluation: Monitor metrics over time as data changes
- Combine Metrics: Use multiple metrics for comprehensive view
- Domain-Specific: Adjust importance based on use case
Common Pitfalls
- Ignoring False Negatives: Missing relevant documents hurts recall
- Too Many False Positives: Retrieving irrelevant documents hurts precision
- Wrong K Value: Using k that's too small or too large for your use case
- Inadequate Ground Truth: Poor quality ground truth leads to misleading metrics
Conclusion
Effective retrieval is the foundation of successful RAG systems. By using metrics like Precision@K, Recall@K, and F1@K, you can:
- Quantify retrieval quality
- Identify areas for improvement
- Make data-driven decisions about system changes
- Ensure your RAG pipeline delivers relevant context to the LLM
Resources
Understanding retrieval metrics is essential for building high-quality RAG systems that deliver accurate and relevant results.