How to Evaluate Retrieval Quality in RAG Pipelines

Precision@k, Recall@k, and F1@k Metrics

Introduction

Understanding and evaluating the retrieval mechanism in RAG (Retrieval-Augmented Generation) pipelines is crucial for building effective AI systems. This guide explores key metrics for evaluating retrieval performance.

Why Care About Retrieval Performance?

The goal is to evaluate how well embedding models and vector databases retrieve candidate text chunks. The key question: "Are the right documents in the top-k retrieved set?"

Binary vs Graded Relevance Measures

Binary Measures

Characterize chunks as relevant or irrelevant
No middle ground
Simpler to implement and interpret

Graded Measures

Assign relevance values on a spectrum
More nuanced evaluation
Better for complex retrieval tasks

Key Evaluation Situations

When using binary measures, we can encounter:

True Positive (TP): Retrieved and relevant ✓
False Positive (FP): Retrieved but irrelevant ✗
True Negative (TN): Not retrieved and not relevant ✓
False Negative (FN): Not retrieved but relevant ✗

Order-Unaware Binary Measures

HitRate@K

Definition: Indicates whether at least one relevant result exists in top-k retrieved chunks.

Formula:

HitRate@K = 1 if (relevant docs in top-k > 0) else 0

Use Case: Simplest metric for basic retrieval validation

Recall@K

Definition: How many of the relevant documents appear within the top-k retrieved documents.

Formula:

Recall@K = (Relevant docs in top-k) / (Total relevant docs)

Characteristics:

Ranges from 0 to 1
Focus on quantity of retrieved results
Answers: "Out of all relevant items, how many did we find?"

When to Use: When you need to find as many relevant results as possible, even if some irrelevant ones are included.

Precision@K

Definition: How many of the top-k retrieved documents are actually relevant.

Formula:

Precision@K = (Relevant docs in top-k) / k

Characteristics:

Ranges from 0 to 1
Focus on quality of retrieved results
Answers: "Out of what we retrieved, how many are correct?"

When to Use: When quality matters more than quantity - you want high confidence in each result.

F1@K

Definition: Harmonic mean of Precision@K and Recall@K, balancing both metrics.

Formula:

F1@K = 2 * (Precision@K * Recall@K) / (Precision@K + Recall@K)

Characteristics:

Ranges from 0 to 1
High only when both precision and recall are high
Balanced evaluation metric

When to Use: When you need both accurate AND comprehensive results.

Practical Example: War and Peace

Example using YAMNet embeddings and FAISS for vector search:

# Retrieval evaluation metrics

def hit_rate_at_k(retrieved_docs, ground_truth_texts, k):
    for doc in retrieved_docs[:k]:
        if any(gt in doc.page_content for gt in ground_truth_texts):
            return True
    return False

def precision_at_k(retrieved_docs, ground_truth_texts, k):
    hits = sum(1 for doc in retrieved_docs[:k] 
               if any(gt in doc.page_content for gt in ground_truth_texts))
    return hits / k

def recall_at_k(retrieved_docs, ground_truth_texts, k):
    matched = set()
    for i, gt in enumerate(ground_truth_texts):
        for doc in retrieved_docs[:k]:
            if gt in doc.page_content:
                matched.add(i)
                break
    return len(matched) / len(ground_truth_texts)

def f1_at_k(precision, recall):
    if (precision + recall) == 0:
        return 0
    return 2 * precision * recall / (precision + recall)

Implementation Tips

Ground Truth Creation

Define test queries
Identify truly relevant chunks for each query
Create ground truth dictionary mapping queries to relevant chunks

Evaluation Process

Run retrieval for test queries
Calculate metrics for each query
Average results across all queries
Analyze patterns and failure modes

Example Results

For query "Who is Anna Pávlovna?" with k=10:

Hit@10: True (at least one relevant chunk found)
Precision@10: 0.20 (2 out of 10 retrieved were relevant)
Recall@10: 0.67 (found 2 out of 3 relevant chunks)
F1@10: 0.31 (moderate performance due to low precision)

Best Practices

Multiple Queries: Evaluate across diverse query types
Varied K Values: Test with different k to understand trade-offs
Regular Evaluation: Monitor metrics over time as data changes
Combine Metrics: Use multiple metrics for comprehensive view
Domain-Specific: Adjust importance based on use case

Common Pitfalls

Ignoring False Negatives: Missing relevant documents hurts recall
Too Many False Positives: Retrieving irrelevant documents hurts precision
Wrong K Value: Using k that's too small or too large for your use case
Inadequate Ground Truth: Poor quality ground truth leads to misleading metrics

Conclusion

Effective retrieval is the foundation of successful RAG systems. By using metrics like Precision@K, Recall@K, and F1@K, you can:

Quantify retrieval quality
Identify areas for improvement
Make data-driven decisions about system changes
Ensure your RAG pipeline delivers relevant context to the LLM

Resources

Understanding retrieval metrics is essential for building high-quality RAG systems that deliver accurate and relevant results.