Step 5 (retrieval). How to debug retrieval quality

This page describes how to identify the root cause of retrieval problems. Use this page when root cause analysis indicates a root cause Improve Retrieval.

Retrieval quality is arguably the most important component of a RAG application. If the most relevant chunks are not returned for a given query, the LLM does not have access to the necessary information to generate a high-quality response. Poor retrieval can lead to irrelevant, incomplete, or hallucinated output. This step requires manual effort to analyze the underlying data. Mosaic AI Agent Framework, with its tight integration between the data platform (including Unity Catalog and Vector Search), and experiment tracking with MLflow (including LLM evaluation and MLflow Tracing) makes troubleshooting much easier.

Instructions

Follow these steps to address retrieval quality issues:

  1. Open the B_quality_iteration/01_root_cause_quality_issues notebook.
  2. Use the queries to load MLflow traces of the records that had retrieval quality issues.
  3. For each record, manually examine the retrieved chunks. If available, compare them to the ground-truth retrieval documents.
  4. Look for patterns or common issues among the queries with low retrieval quality. For example:
    • Relevant information is missing from the vector database entirely.
    • Insufficient number of chunks or documents returned for a retrieval query.
    • Chunks are too small and lack sufficient context.
    • Chunks are too large and contain multiple, unrelated topics.
    • The embedding model fails to capture semantic similarity for domain-specific terms.
  5. Based on the identified issue, hypothesize potential root causes and corresponding fixes. For guidance, see Common reasons for poor retrieval quality.
  6. Follow the steps in implement and evaluate changes to implement and evaluate a potential fix. This might involve modifying the data pipeline (for example, adjusting chunk size or trying a different embedding model) or modifying the RAG chain (for example, implementing hybrid search or retrieving more chunks).
  7. If retrieval quality is still not satisfactory, repeat steps 4 and 5 for the next most promising fixes until the desired performance is achieved.
  8. Re-run the root cause analysis to determine if the overall chain has any additional root causes that should be addressed.

Common reasons for poor retrieval quality

The following table lists debugging steps and potential fixes for common retrieval issues. Fixes are categorized by component:

  • Data pipeline
  • Chain config
  • Chain code

The component defines which steps you should follow in the implement and evaluate changes step.

Retrieval issue Debugging steps Potential fix
Chunks are too small - Examine chunks for incomplete cut-off information. - Data pipeline Increase chunk size or overlap.
- Data pipeline Try a different chunking strategy.
Chunks are too large - Check if retrieved chunks contain multiple, unrelated topics. - Data pipeline Decrease chunk size.
- Data pipeline Improve chunking strategy to avoid mixture of unrelated topics (for example, semantic chunking).
Chunks don’t have enough information about the text from which they were taken - Assess if the lack of context for each chunk is causing confusion or ambiguity in the retrieved results. - Data pipeline Try adding metadata and titles to each chunk (for example, section titles).
- Chain config Retrieve more chunks, and use an LLM with larger context size.
Embedding model doesn’t accurately understand the domain or key phrases in user queries - Check if semantically similar chunks are being retrieved for the same query. - Data pipeline Try different embedding models.
- Chain config Try hybrid search.
- Chain code Over-fetch retrieval results, and re-rank. Only feed top re-ranked results into the LLM context.
- Data pipeline Fine-tune embedding model on domain-specific data.
Relevant information missing from the vector database - Check if any relevant documents or sections are missing from the vector database. - Data pipeline Add more relevant documents to the vector database.
- Data pipeline Improve document parsing and metadata extraction.
Retrieval queries are poorly formulated - If user queries are being directly used for semantic search, analyze these queries and check for ambiguity or lack of specificity. This can happen easily in multi-turn conversations where the raw user query references previous parts of the conversation, making it unsuitable to use directly as a retrieval query.
- Check if query terms match terminology used in the search corpus.
- Chain code Add query expansion or transformation approaches (for example, given a user query, transform the query prior to semantic search).
- Chain code Add query understanding to identify intent and entities (for example, use an LLM to extract properties to use in metadata filtering).

Next step

If you also identified issues with generation quality, continue with Step 5 (generation). How to debug generation quality.

If you think that you have resolved all of the identified issues, continue with Step 6. Iteratively implement & evaluate quality fixes.