How can I create an Azure Cognitive Search indexer to retrieve data from SharePoint, split it into chunks, and generate embeddings using skills?

Charlie 30 Reputation points


I have a SharePoint data source, an Azure Cognitive Search resource, and an index setup. The index has fields that I want to search via ACS (name, content, name_vector, content_vector) and metadata fields (id, path, content_type, last_modifed_date, size).

I'm trying to create an indexer to pull data from the SharePoint data source, make necessary transformations, and push that data to the index. The indexer must split the SharePoint documents into chunks and generate embeddings. This data will then fill the index: chunk text -> content; chunk embeddings -> content_vector; name text -> name; name embeddings -> name_vector; etc. Here, 'name' is the name of the SharePoint document, which means there will be the same name (and other metadata fields) for multiple chunks/entries in the index. There is not a one-to-one relationship between SharePoint documents and entries of the index.

I saw this post which suggests that I would need an Azure Function to 'chunk and embed' the SharePoint data. That function then becomes the data source for the indexer. This might be a viable solution, but I'm curious - will the indexer still be able to detect changes on the SharePoint site if that is no longer the direct data source?

The other, and simpler, solution would be to add skills to the indexer. Microsoft has an AzureOpenAIEmbeddingSkill class and a SplitSkill class. How can I create the following indexer:

  • Input: SharePoint data source documents and metadata
  • Run Skill: Create chunks with SplitSkill
  • Run Skill: Generate embeddings from chunks with AzureOpenAIEmbeddingSkill
  • Output: Stores embeddings and metadata for each chunk in the index

What are the steps to set up this indexer? Do I need to create an Azure Function for this?

Thank you.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
754 questions
A group of Microsoft Products and technologies used for sharing and managing content, knowledge, and applications.
9,849 questions
{count} votes