How to best structure skillsets and indexing flow for chunking, embedding, and querying uploaded files via Azure AI Search and OpenAI?

Mo Ibrahim 0 Reputation points
2025-07-02T11:47:26.2466667+00:00

I'm building a pipeline to upload academic files (e.g., PDFs), split and embed them using skillsets, and then query them via Azure AI Search + OpenAI. My goal is to generate academic questions using document content only.


Current Flow

  1. Upload a file to Blob Storage with metadata (e.g., userId, fileName).
  2. Indexer runs with a skillset that includes:
    • SplitSkill (mode: pages, max length 4000, overlap 300)
    • AzureOpenAIEmbeddingSkill for generating vector embeddings.
  3. The index projection outputs:
    • chunk, chunk_vector, fileName, userId, blobPath
  4. I then use Azure OpenAI with azure_search as the data_source:
       {
    
  5. The final prompt sent to OpenAI (via /chat/completions) includes:
    • A system prompt like:
      "You are a virtual assistant for academic instructors."
    • Instructions to return {Questions:[]} if the document has insufficient or unrelated content.

What I Need Help With

  • Can I further reduce the size of each chunk before embedding by ignoring irrelevant characters or whitespace before the SplitSkill? I want to avoid wasting tokens on junk formatting.
  • Can I apply character normalization (e.g., regex, spacing, junk removal) in a single skill, ideally before splitting?
  • Is there a more efficient or alternative approach to achieve this overall process (splitting, embedding, indexing, and querying)? I’m open to other architectural patterns or using different Azure tools.
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,111 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Manas Mohanty 6,285 Reputation points Microsoft External Staff Moderator
    2025-07-07T06:30:40.0833333+00:00

    Hi Mo Ibrahim

    It seems you are able to pull data correctly from AI search queries, but it is not pulling the data result correctly from Azure OpenAI side.

    Could you do the following and let us know.

    1. Allows Azure OpenAI URLs, storage URLs from AI search CORS policy (Cross origin resource sharing).
    2. Add Azure OpenAI and trusted resource from AI search side - https://learn.microsoft.com/en-us/azure/search/service-configure-firewall
    3. lower temperature ( to 0.3 or .2) and increase Top_N and Top_k
    4. Improve system instruction with specific needs on desired outputs (more examples and failovers)
    5. Check official dotnet SDK code for add your own data scenario.

    https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/use-your-data?tabs=ai-search%2Ccopilot

    On AI search side, you can opt or hybrid approach (Semantic+Vector indexing)

    Please share your configuration details on Azure AI search (CORS policy) and Azure OpenAI (Top_N, Top_K, Temperature, Chunk Size) side for further assistance.

    Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.