How can I create an Azure Cognitive Search indexer to retrieve data from SharePoint, split it into chunks, and generate embeddings using skills?

Charlie 30 Reputation points
2023-10-20T14:09:58.95+00:00

Hello,

I have a SharePoint data source, an Azure Cognitive Search resource, and an index setup. The index has fields that I want to search via ACS (name, content, name_vector, content_vector) and metadata fields (id, path, content_type, last_modifed_date, size).

I'm trying to create an indexer to pull data from the SharePoint data source, make necessary transformations, and push that data to the index. The indexer must split the SharePoint documents into chunks and generate embeddings. This data will then fill the index: chunk text -> content; chunk embeddings -> content_vector; name text -> name; name embeddings -> name_vector; etc. Here, 'name' is the name of the SharePoint document, which means there will be the same name (and other metadata fields) for multiple chunks/entries in the index. There is not a one-to-one relationship between SharePoint documents and entries of the index.

I saw this post which suggests that I would need an Azure Function to 'chunk and embed' the SharePoint data. That function then becomes the data source for the indexer. This might be a viable solution, but I'm curious - will the indexer still be able to detect changes on the SharePoint site if that is no longer the direct data source?

The other, and simpler, solution would be to add skills to the indexer. Microsoft has an AzureOpenAIEmbeddingSkill class and a SplitSkill class. How can I create the following indexer:

  • Input: SharePoint data source documents and metadata
  • Run Skill: Create chunks with SplitSkill
  • Run Skill: Generate embeddings from chunks with AzureOpenAIEmbeddingSkill
  • Output: Stores embeddings and metadata for each chunk in the index

What are the steps to set up this indexer? Do I need to create an Azure Function for this?

Thank you.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,339 questions
Microsoft 365 and Office SharePoint For business Windows
{count} votes

1 answer

Sort by: Most helpful
  1. Andrej Melicher 90 Reputation points
    2024-08-17T14:59:52.3333333+00:00

    I found this very useful article about using projection approach when creating Skillset with document chunking and generating embeddings for each chunk in single indexer.

    The example is using Azure Blob Storage, but I was able to manage same results with Sharepoint Document Library as data source.

    2 people found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.