Why does changing temperature change the documents retrieved by RAG, and why do those documents vary when making the same request multiple times?

Question

Why does changing temperature change the documents retrieved by RAG, and why do those documents vary when making the same request multiple times?

Bao, Jeremy (Cognizant) 105

In RAG, you simply pass the user's input to Azure AI Search, which will fetch relevant document chunks, right? I have tried calling the same request multiple times in certain cases, with identical system prompt and message history, but different document chunks were retrieved each time. Also, changing temperature seems to have an impact on the chunks retrieved as well. Why would the document chunks vary even when no changes are made, or when temperature is changed, but inputs are kept constant?

Grmacjon-MSFT 19,301 Reputation points Moderator

2024-04-10T21:31:11.78+00:00

Hi @Bao, Jeremy (Cognizant) Thanks for bringing this our attention. We are checking internally with the engineering team to get more insights on your question and will get back to you when we hear back from them.
Grmacjon-MSFT 19,301 Reputation points Moderator

2024-04-15T22:22:25.4833333+00:00

@Bao, Jeremy (Cognizant) checking to see if the below response from the ACS engineering team help answer your question. Please remember to "Accept Answer" if any answer/reply helped, so that others in the community facing similar issues can easily find the solution. Thanks!

Accepted answer

0 additional answers

Your answer

Grmacjon-MSFT 19,301 Reputation points Moderator

2024-04-10T21:31:11.78+00:00

Hi @Bao, Jeremy (Cognizant) Thanks for bringing this our attention. We are checking internally with the engineering team to get more insights on your question and will get back to you when we hear back from them.
Grmacjon-MSFT 19,301 Reputation points Moderator

2024-04-15T22:22:25.4833333+00:00

@Bao, Jeremy (Cognizant) checking to see if the below response from the ACS engineering team help answer your question. Please remember to "Accept Answer" if any answer/reply helped, so that others in the community facing similar issues can easily find the solution. Thanks!

Answer 1

Hi Jeremy, great question. To ensure we're talking about the same RAG, does your LLM model generate queries to issue to the search index, or does another information retrieval component take the user query and send it verbatim to the search index (before the LLM sees it) and then send both the query and response to the LLM to summarize?

i.e., is the LLM model tasked with generating or rewriting the user query to be a better query? Based on where you state "identical system prompt and message history", it suggests to me this is the scenario you are using, where the LLM can create search queries based on the message history and user inputs.

To improve debugging, I recommend you try your queries after decoupling your RAG application from the search queries you issue to the search index. That way, you reduce the number of variables to help determine if the search index is consistent.

If you're using the first scenario where the LLM is prompted something like, "you have access to a search index. You can only issue search queries to the search index and use the information to answer the user's question and provide references", then the LLM will likely generate slightly different search queries on consecutive runs, even for a constant temperature, because LLM completion has some innate variability. These search queries can retrieve different results from the index because they are not identical queries. If you use approaches that better understand the semantic meaning of the queries, such as vector search or hybrid search, you may find reduced variability in the search responses. Increasing the temperature of the LLM essentially decreases the influence of higher probability tokens when the LLM is predicting the next token, which will allow the LLM to make more varied choices, generating more "creative" responses. This will result in more variability into the search queries the LLM generates, making the search results also more varied.

If it's the second scenario, then decouple the responses provided by the initial search from the LLM component. Issue the same query repeatedly against the search index and observe if the results vary significantly. If they do, there may be some tuning you need to perform on the search index. Some examples:

Due to the sharded nature of search indexes for multi-replica search clusters, consistency can be influenced by different replicas containing different distributions of documents in given shards. This affects the statistics of the search operation, since scoring functions like BM25 compare things such as term frequency, document frequency, document length, term frequency saturation, and these statistics are dependent on the exact set of documents in a given shard. Setting the variable sessionId as defined here informs the search engine to try to route the request to the same replica units that served the initial request (but it's a best-effort attempt, not a guarantee).
Consistency can also be influenced by active indexing operations since the there will be a slight lag between indexing operations and those operations being applied across all nodes within the search cluster. This is why distributed indexes are often "eventually consistent".

Bao, Jeremy (Cognizant) 105 Reputation points

2024-04-16T00:03:13.79+00:00

I am not sure if the LLM is producing queries for the search index or not. I have used various approaches, but my current one is to use requests targeting the Azure OpenAI API like so:

{ "data_sources": [ { "type": "AzureCognitiveSearch", "parameters": { "authentication": { "key": <API Key for my Azure AI Search Service>, "type": "api_key" }, "endpoint": <Endpoint for my Azure AI Search Service>, "index_name": <Name of my Search Index>, "embedding_dependency": { "deployment_name": <Name of my Embedding Deployment>, "type": "deployment_name" }, "fields_mapping": { "content_fields": ["chunk"], "vector_fields": ["vector"], "title_field": "title" }, "query_type": "vector_simple_hybrid" } } ], "temperature": 0.56, "messages": [ { "role": "system", "content": <My Prompt> }, { "role": "user", "content": <A Query> } ] }

I do not know if my messages are being fed into the LLM to generate queries for searching, or if my messages are directly being fed into the search service to get context. My search service has only 1 replica, and I do not regularly perform indexing.
Gianinna Mondragon 15 Reputation points Microsoft Employee

2024-04-18T19:25:30.7433333+00:00

If temperature is greater than 0, any time the model is called, it can generate a slightly different answer. Even when temperature is 0, there can be some variation in responses when the input varies slightly in terms of wording, casing, punctuation, etc.

FYI in general for RAG applications (to clarify further than the temperature question, depending on what operations are being performed from the orchestrator), from the orchestrator there is a common practice:

Make a call to the model to generate search intents based on the user query

Use the intents to search in the index for relevant documents

Make a call to the model to generate an answer based on the user query and retrieved documents

So, this flow would have even more variations on the responses, since the search intents might be different, so generating different results and finally how the LLM generates the final response with them.

Share via

Why does changing temperature change the documents retrieved by RAG, and why do those documents vary when making the same request multiple times?

0 additional answers

Your answer