About data sources created with Azure OpenAI Service "Add Your Data"

Question

About data sources created with Azure OpenAI Service "Add Your Data"

T. TABATA 70

"Add Your Data" automatically creates an index that is divided into chunks by setting it on the data existing in the blob storage. It also creates an indexer at the same time.

At this time, when you import a file that needs to be chunked into blob storage,

is the indexer automatically chunked and automatically reflected in the index?

1 answer

Your answer

Answer 1

Hi @Anonymous thank you for the question.

Yes, when you use the “Add Your Data” feature in Azure OpenAI Service, it does create an index and an indexer. However, the process of chunking the data is not automatically handled by the indexer.

Chunking is important because the models used to generate embedding vectors have maximum limits on the text fragments provided as input. If your source documents are too large for the maximum input size imposed by models, you will need to insert a chunking step into your workflow

Here’s a simple process to add your data using Azure OpenAI Studio:

Navigate to Azure OpenAI Studio and sign-in with credentials that have access to your Azure OpenAI resource.
During or after the sign-in workflow, select the appropriate directory, Azure subscription, and Azure OpenAI resource.
Select the Chat playground tile.
On the Assistant setup tile, select Add your data (preview) > + Add a data source.
In the pane that appears, select Upload files under Select data source. Select Upload files.
Azure OpenAI needs both a storage resource and a search resource to access and index your data

For documents and datasets with long text, it’s recommended to use a data preparation script. If you have large documents, you must insert a chunking step into indexing and query workflows that breaks up large text. Some libraries that provide chunking include LangChain and Semantic Kernel

Even though this process does not automatically chunk your data into smaller pieces, you can use a custom skill that you can attach to a skillset that brings chunking into the indexing pipeline:

https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md

https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb

Hope that helps.

Best,

Grace

T. TABATA 70 Reputation points

2023-10-10T05:39:16.34+00:00

Do you mean to create workflows instead of defining them in a skillset?

If you check the indexer created with "Add Your Data", you can see the same definition as in the sample.

https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md#sample-skillset-integration
T. TABATA 70 Reputation points

2023-10-10T07:18:52.3066667+00:00

It is vectorized with the "Add Your Data" setting, but how does it work internally?

Assuming that something like the following is in operation, Azure Functions should be set, but nothing is set in our Functions.

https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md
Patrick Höveler 0 Reputation points

2023-10-27T10:13:00.54+00:00

Hi any updates on how to define the skillset? Is there any guideline on how to adapt the skillset configuration from this page https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md#sample-skillset-integration ? At the moment it is not working for us working with azure cognitive search the EmbeddingGenerator Azure Function and the Skillset definition to chunk and embed data from a storage account. Any help appreciated.

Edit: Response I get when using the default configuration: Web Api response status: 'BadRequest', Web Api response details: 'Invalid request: 'document_id' is a required property

Share via

About data sources created with Azure OpenAI Service "Add Your Data"

1 answer

Your answer