How to split text into limited sizes when using Sharepoint Indexer in Azure Cognitive Search

Mathias Opland 140 Reputation points
2023-07-19T07:34:34.4633333+00:00

Hi,

I'm creating an index in Azure Cognitive search that uses Sharepoint as data source. However, when I use this, each sharepoint site is contained within a single record in the index, making the text content for each record rather large in some cases. I want to be able to set a character/word limit to the content of the record, so that if a Sharepoint site exceeds this limits, a new record is created.

I've tried using the split text skillset (https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-textsplit), however as I understand it, it will not create multiple instances, but instead create a list within the existing records. When I tried to create a custom skillset that created new record IDs, I got an error regarding the record ID, leading me to believe that a skillset cannot add additional records to an index. Does anybody have any thoughts or can point me in the right direction?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
865 questions
0 comments No comments
{count} vote

Accepted answer
  1. Grmacjon-MSFT 17,456 Reputation points
    2023-07-19T20:58:25.5933333+00:00

    Hi @Mathias Opland thanks for the question. Can you share the full error message you're getting?

    and yes you are right - a skillset cannot add additional records to an index. Skillsets are designed to modify the content of existing records, not to create new records.

    To split the content of a SharePoint site into multiple records based on a character or word limit, you can use an Azure Function to preprocess the data before it is indexed. Here's an example of how you can do this:

    Create an Azure Function that retrieves the data from SharePoint and splits it into multiple records based on a character or word limit. You can use the SharePoint REST API to retrieve the data, and a string manipulation library like Apache Commons Lang to split the text.

    Configure the Azure Function to output the data in the format expected by the Azure Cognitive Search index. This may involve mapping the fields from the SharePoint data to the fields in the search index, and splitting the data into multiple records as needed.

    Configure the Azure Cognitive Search indexer to use the Azure Function as a data source. You can do this by creating a custom data source that points to the Azure Function URL, and configuring the indexer to use this data source.

    Run the indexer to populate the search index with the data from SharePoint, split into multiple records as needed.

    By using an Azure Function to preprocess the data, you can split the content of a SharePoint site into multiple records based on a character or word limit, and index the data in Azure Cognitive Search in the desired format.

    -Grace


0 additional answers

Sort by: Most helpful