Trying to implement chunking and embeddings using Azure AI Search

DShree 396 Reputation points
2023-12-31T12:52:01.75+00:00

I am trying to create a search solution using Azure AI Search/cognitive search, and I need to chunk the data so that the retrieved text is limited and more relevant. I also want to implement a Hybrid search over the data and try out embedding creation.

I tried using split text and AzureOpenaiEmbedding skillsets, but they are not getting indexed. My goal is to use OCR, key phrase extraction, etc. and use the merge skill to collate them and then do chunking and embeddings. Additionally, I would like to add incremental indexing as a feature. I followed the Azure documentation while creating these skillsets, so please advise on where I am going wrong.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,080 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Ramr-msft 17,826 Reputation points
    2024-01-02T02:33:40.79+00:00

    Thanks for the question, Retrieval Augmented Generation or “RAG” is one of the most popular architectural patterns for building data-infusedLLM applications. Azure OpenAI Service on your data automates many of the components of this architecture (ingestion, chunking, deployment), allowing customers to rapidly build use cases involving enterprise search or knowledge retrieval. If you're building your own RAG implementation (rather than using the AOAI one) then you'll need to take more ownership over this process: Your document chunking pipeline will need to ensure it's storing the source URL as part of this process.

    Chunking Data: Chunking is important when source documents are too large for the maximum input size imposed by models. You can use the Text Split skill for chunking. If your documents are too large, you must insert a chunking step into indexing and query workflows.

    1. Embedding Creation: Azure AI Search doesn’t host vectorization models, so one of your challenges is creating embeddings for query inputs and outputs. You can use any embedding model, but Azure OpenAI embeddings models are commonly used. Integrated vectorization, currently in preview, offers embedded vectorization.
    2. Hybrid Search: Hybrid search is a combination of full text and vector queries that execute against a search index that contains both searchable plain text content and generated embeddings. A hybrid query combines full text search and vector search.
    3. Skillsets Not Getting Indexed: If the SplitText and AzureOpenaiEmbedding skillsets are not getting indexed, it could be due to several reasons. It might be helpful to check the error logs for more specific information. Also, ensure that the skillsets are correctly defined and connected in your indexer.
    4. Using OCR, Key Phrase Extraction, and Merge Skill: The OCR skill recognizes printed and handwritten text in image files. The Key Phrase Extraction skill evaluates unstructured text and returns a list of key phrases. The Merge skill can be used to collate the results.
    5. Incremental Indexing: Incremental enrichment refers to the use of cached enrichments during skillset execution so that only new and changed skills and documents incur AI processing. The cache contains the output from document cracking, plus the outputs of each skill for every document.
    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.