Hi Jeremy, great question. To ensure we're talking about the same RAG, does your LLM model generate queries to issue to the search index, or does another information retrieval component take the user query and send it verbatim to the search index (before the LLM sees it) and then send both the query and response to the LLM to summarize?
i.e., is the LLM model tasked with generating or rewriting the user query to be a better query? Based on where you state "identical system prompt and message history", it suggests to me this is the scenario you are using, where the LLM can create search queries based on the message history and user inputs.
To improve debugging, I recommend you try your queries after decoupling your RAG application from the search queries you issue to the search index. That way, you reduce the number of variables to help determine if the search index is consistent.
If you're using the first scenario where the LLM is prompted something like, "you have access to a search index. You can only issue search queries to the search index and use the information to answer the user's question and provide references", then the LLM will likely generate slightly different search queries on consecutive runs, even for a constant temperature, because LLM completion has some innate variability. These search queries can retrieve different results from the index because they are not identical queries. If you use approaches that better understand the semantic meaning of the queries, such as vector search or hybrid search, you may find reduced variability in the search responses. Increasing the temperature of the LLM essentially decreases the influence of higher probability tokens when the LLM is predicting the next token, which will allow the LLM to make more varied choices, generating more "creative" responses. This will result in more variability into the search queries the LLM generates, making the search results also more varied.
If it's the second scenario, then decouple the responses provided by the initial search from the LLM component. Issue the same query repeatedly against the search index and observe if the results vary significantly. If they do, there may be some tuning you need to perform on the search index. Some examples:
- Due to the sharded nature of search indexes for multi-replica search clusters, consistency can be influenced by different replicas containing different distributions of documents in given shards. This affects the statistics of the search operation, since scoring functions like BM25 compare things such as term frequency, document frequency, document length, term frequency saturation, and these statistics are dependent on the exact set of documents in a given shard. Setting the variable
sessionId
as defined here informs the search engine to try to route the request to the same replica units that served the initial request (but it's a best-effort attempt, not a guarantee). - Consistency can also be influenced by active indexing operations since the there will be a slight lag between indexing operations and those operations being applied across all nodes within the search cluster. This is why distributed indexes are often "eventually consistent".