Why does changing temperature change the documents retrieved by RAG, and why do those documents vary when making the same request multiple times?

Bao, Jeremy (Cognizant) 100 Reputation points
2024-04-10T17:46:09.8866667+00:00

In RAG, you simply pass the user's input to Azure AI Search, which will fetch relevant document chunks, right? I have tried calling the same request multiple times in certain cases, with identical system prompt and message history, but different document chunks were retrieved each time. Also, changing temperature seems to have an impact on the chunks retrieved as well. Why would the document chunks vary even when no changes are made, or when temperature is changed, but inputs are kept constant?

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,172 questions
{count} votes

Accepted answer
  1. Robert Lee 85 Reputation points Microsoft Employee
    2024-04-12T17:10:37.0866667+00:00

    Hi Jeremy, great question. To ensure we're talking about the same RAG, does your LLM model generate queries to issue to the search index, or does another information retrieval component take the user query and send it verbatim to the search index (before the LLM sees it) and then send both the query and response to the LLM to summarize?

    i.e., is the LLM model tasked with generating or rewriting the user query to be a better query? Based on where you state "identical system prompt and message history", it suggests to me this is the scenario you are using, where the LLM can create search queries based on the message history and user inputs.

    To improve debugging, I recommend you try your queries after decoupling your RAG application from the search queries you issue to the search index. That way, you reduce the number of variables to help determine if the search index is consistent.

    If you're using the first scenario where the LLM is prompted something like, "you have access to a search index. You can only issue search queries to the search index and use the information to answer the user's question and provide references", then the LLM will likely generate slightly different search queries on consecutive runs, even for a constant temperature, because LLM completion has some innate variability. These search queries can retrieve different results from the index because they are not identical queries. If you use approaches that better understand the semantic meaning of the queries, such as vector search or hybrid search, you may find reduced variability in the search responses. Increasing the temperature of the LLM essentially decreases the influence of higher probability tokens when the LLM is predicting the next token, which will allow the LLM to make more varied choices, generating more "creative" responses. This will result in more variability into the search queries the LLM generates, making the search results also more varied.

    If it's the second scenario, then decouple the responses provided by the initial search from the LLM component. Issue the same query repeatedly against the search index and observe if the results vary significantly. If they do, there may be some tuning you need to perform on the search index. Some examples:

    • Due to the sharded nature of search indexes for multi-replica search clusters, consistency can be influenced by different replicas containing different distributions of documents in given shards. This affects the statistics of the search operation, since scoring functions like BM25 compare things such as term frequency, document frequency, document length, term frequency saturation, and these statistics are dependent on the exact set of documents in a given shard. Setting the variable sessionId as defined here informs the search engine to try to route the request to the same replica units that served the initial request (but it's a best-effort attempt, not a guarantee).
    • Consistency can also be influenced by active indexing operations since the there will be a slight lag between indexing operations and those operations being applied across all nodes within the search cluster. This is why distributed indexes are often "eventually consistent".
    2 people found this answer helpful.

0 additional answers

Sort by: Most helpful