Persistent Limitations in Azure OpenAI for Multi-Document RAG Analysis — Follow-Up on Assistant & Agent Recommendations

Vishav Singh | CHECKER 20 Reputation points
2025-05-28T05:25:25.66+00:00

**Hello MS support team,

Issue Summary:** We were building a large-scale feedback analysis solution using Azure OpenAI + Cognitive Search (RAG architecture) to extract insights from ~10,000 structured survey documents.

While MS Support team previous response recommended migrating from Chat Completion RAG to Azure OpenAI Assistant, and more recently to AI Agent, we continue to face critical limitations that affect accuracy, reliability, and scalability.


What We’ve Implemented Based on Your Recommendations

Azure OpenAI Assistant We tested Assistant-based retrieval and agree it delivers better response quality compared to the Chat Completion RAG model.

However, several blocking issues persist:

🔴 File Size and Token Limitations: We are forced to build custom logic to validate and truncate content at the line level before file ingestion. Failing to do so causes indexing failures and breaks search accuracy.

  • 🔴 Vector Store Capacity Limit: The current 10,000 file cap per vector store is unscalable for our needs. Since Assistants can only link to one vector store, creating and managing multiple stores is inefficient and impractical for projects with large datasets and frequent updates.
  1. AI Agent (Preview) Thereafter, support team recommended to use Agent feature due to its ability to link indexes. However it has:

🔴 Search Scope Issue: The Agent appears to return information beyond the linked index content — for example, answering with generalized or fabricated data when it should be restricted to the linked documents.

  • ➤ Example: When asked, “How many questions do we have in Japan that has parking space?” — the Agent responded with "thousands," even though the index contains only one document with this detail. (But we solved this with giving instructions to Agent)
  1. LLM Context Limitations Still Persist Even after linking an index (Agent), the LLM still appears to use only a very small portion of retrieved documents/content to formulate a response. This prevents meaningful summarization or insight generation across even modest datasets (e.g., 20–50 feedback records).

What We Need Support On:

  1. Agent Configuration: Is there a way to strictly confine the Agent’s answers to the linked index content (like a closed-book RAG model)? Are there flags/settings to disable its default "open-box" behavior?
  2. Assistant Scaling: Are there any upcoming changes or workarounds to:
    • Raise or bypass the 10k file limit on vector stores?
    • Link multiple vector stores to a single Assistant?
    • Improve file ingestion flexibility without extensive manual pre-processing?
  3. Cross-Chunk & Multi-Record Context Handling: Is there a recommended architecture for enabling LLMs to reason across multiple retrieved records (e.g., summarizing 10k documents at once)? Current Assistant/Agent implementations still exhibit GPT's default behavior of responding to only the top 4–5 chunks.

Regards

Vishav Deep Singh

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,339 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Alex Burlachenko 9,780 Reputation points
    2025-05-28T07:51:01.8633333+00:00

    hi vishav!

    thanks for throwing this question out here, super detailed and super helpful for others wrestling with the same stuff ))

    alright, let’s break it down. u’re hitting some real pain points with azure openai assistants and agents, especially around scaling and accuracy.

    file size & token limits

    yeah, the file size thing is a headache. right now, assistants choke if docs are too big or messy. u’re already doing the smart thing with custom truncation, pre-split ur docs into smaller chunks before ingestion. use something like the text splitter in langchain (or a simple python script) to break ‘em down by paragraphs or sections. that way, u avoid the line-level chaos.

    vector store capacity

    10k files per store is tight, no lie. for now, u gotta juggle multiple stores if u’re over that limit. but! u can automate the linking part with the api spin up new stores dynamically and attach ‘em as needed. it’s clunky, but it works. microsoft’s working on scaling this (fingers crossed), but no ETA yet.

    agent going rogue

    ugh, the agent pulling answers outta thin air is frustrating. to lock it down, u gotta hammer the instructions. like, really specific. try something like: “only use info from the linked index. if it’s not there, say ‘i don’t know’.” also, check the strict_mode flag in the agent config it’s in preview, but might help.

    llm context limits

    this one’s a beast. even with RAG, the llm’s attention span is… short. to squeeze more in, try tweaking the chunk_size and overlap in ur index settings. smaller chunks + overlap can help it “see” more connections. for summarization across tons of docs, u might need a hybrid approach: first, use cognitive search to pull the top relevant chunks, then feed those into the llm with a prompt like “summarize these records, not ur general knowledge.”

    upcoming fixes?

    microsoft’s been pretty hush-hush, but the agent/assistant stuff is evolving fast. check the azure updates blog they drop surprises there.

    hang in there! u’re already ahead of the curve by mixing assistants and agents. if u nail the pre-processing and instructions, u can brute-force ur way to something workable for now....

    rgds,

    Alex

    https://ctrlaltdel.blog/


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.