Back to Resources
Articles

How to Evaluate Retrieval Quality in RAG Pipelines

How to Evaluate Retrieval Quality in RAG Pipelines

Maria Mouschoutzi
October 16, 2025

How to Evaluate Retrieval Quality in RAG Pipelines

Precision@k, Recall@k, and F1@k Metrics

RAG Evaluation Banner
RAG Evaluation Banner

Introduction

Understanding and evaluating the retrieval mechanism in RAG (Retrieval-Augmented Generation) pipelines is crucial for building effective AI systems. This guide explores key metrics for evaluating retrieval performance.

Why Care About Retrieval Performance?

The goal is to evaluate how well embedding models and vector databases retrieve candidate text chunks. The key question: "Are the right documents in the top-k retrieved set?"

Binary vs Graded Relevance Measures

Binary Measures

  • Characterize chunks as relevant or irrelevant
  • No middle ground
  • Simpler to implement and interpret

Graded Measures

  • Assign relevance values on a spectrum
  • More nuanced evaluation
  • Better for complex retrieval tasks

Key Evaluation Situations

When using binary measures, we can encounter:

  • True Positive (TP): Retrieved and relevant ✓
  • False Positive (FP): Retrieved but irrelevant ✗
  • True Negative (TN): Not retrieved and not relevant ✓
  • False Negative (FN): Not retrieved but relevant ✗

Order-Unaware Binary Measures

HitRate@K

Definition: Indicates whether at least one relevant result exists in top-k retrieved chunks.

Formula:

HitRate@K = 1 if (relevant docs in top-k > 0) else 0

Use Case: Simplest metric for basic retrieval validation

Recall@K

Definition: How many of the relevant documents appear within the top-k retrieved documents.

Formula:

Recall@K = (Relevant docs in top-k) / (Total relevant docs)

Characteristics:

  • Ranges from 0 to 1
  • Focus on quantity of retrieved results
  • Answers: "Out of all relevant items, how many did we find?"

When to Use: When you need to find as many relevant results as possible, even if some irrelevant ones are included.

Precision@K

Definition: How many of the top-k retrieved documents are actually relevant.

Formula:

Precision@K = (Relevant docs in top-k) / k

Characteristics:

  • Ranges from 0 to 1
  • Focus on quality of retrieved results
  • Answers: "Out of what we retrieved, how many are correct?"

When to Use: When quality matters more than quantity - you want high confidence in each result.

F1@K

Definition: Harmonic mean of Precision@K and Recall@K, balancing both metrics.

Formula:

F1@K = 2 * (Precision@K * Recall@K) / (Precision@K + Recall@K)

Characteristics:

  • Ranges from 0 to 1
  • High only when both precision and recall are high
  • Balanced evaluation metric

When to Use: When you need both accurate AND comprehensive results.

Practical Example: War and Peace

Example using YAMNet embeddings and FAISS for vector search:

# Retrieval evaluation metrics

def hit_rate_at_k(retrieved_docs, ground_truth_texts, k):
    for doc in retrieved_docs[:k]:
        if any(gt in doc.page_content for gt in ground_truth_texts):
            return True
    return False

def precision_at_k(retrieved_docs, ground_truth_texts, k):
    hits = sum(1 for doc in retrieved_docs[:k] 
               if any(gt in doc.page_content for gt in ground_truth_texts))
    return hits / k

def recall_at_k(retrieved_docs, ground_truth_texts, k):
    matched = set()
    for i, gt in enumerate(ground_truth_texts):
        for doc in retrieved_docs[:k]:
            if gt in doc.page_content:
                matched.add(i)
                break
    return len(matched) / len(ground_truth_texts)

def f1_at_k(precision, recall):
    if (precision + recall) == 0:
        return 0
    return 2 * precision * recall / (precision + recall)

Implementation Tips

Ground Truth Creation

  1. Define test queries
  2. Identify truly relevant chunks for each query
  3. Create ground truth dictionary mapping queries to relevant chunks

Evaluation Process

  1. Run retrieval for test queries
  2. Calculate metrics for each query
  3. Average results across all queries
  4. Analyze patterns and failure modes

Example Results

For query "Who is Anna Pávlovna?" with k=10:

  • Hit@10: True (at least one relevant chunk found)
  • Precision@10: 0.20 (2 out of 10 retrieved were relevant)
  • Recall@10: 0.67 (found 2 out of 3 relevant chunks)
  • F1@10: 0.31 (moderate performance due to low precision)

Best Practices

  1. Multiple Queries: Evaluate across diverse query types
  2. Varied K Values: Test with different k to understand trade-offs
  3. Regular Evaluation: Monitor metrics over time as data changes
  4. Combine Metrics: Use multiple metrics for comprehensive view
  5. Domain-Specific: Adjust importance based on use case

Common Pitfalls

  • Ignoring False Negatives: Missing relevant documents hurts recall
  • Too Many False Positives: Retrieving irrelevant documents hurts precision
  • Wrong K Value: Using k that's too small or too large for your use case
  • Inadequate Ground Truth: Poor quality ground truth leads to misleading metrics

Conclusion

Effective retrieval is the foundation of successful RAG systems. By using metrics like Precision@K, Recall@K, and F1@K, you can:

  • Quantify retrieval quality
  • Identify areas for improvement
  • Make data-driven decisions about system changes
  • Ensure your RAG pipeline delivers relevant context to the LLM

Resources


Understanding retrieval metrics is essential for building high-quality RAG systems that deliver accurate and relevant results.