Document not deleted when Blob is removed using Indexers Automatically generated from Azure OpenAI Data Studio

Abby Greentree 171 Reputation points
2024-06-10T15:21:44.2066667+00:00

I am testing an Azure OpenAI on your data solution. I have set up an Azure Open AI resource and walked through the 'Add your data' workflow with Azure Blob Storage and the backing data source, as discussed here this generates two indexers in the associated Azure Search resource- one indexer to chunk the data, and one indexer to index the chunks.

Periodically we upload, change, or delete blob in the source container and re-run the indexers. We have observed that when a blob is removed - the associated documents no longer appear in the first index (populated by the indexer that chunks the documents), but does still appear in the final index (populated by the indexer that indexes the chunked documents)- leaving them as orphaned documents.

I am wondering if anyone else is having this issue with orphaned documents when implementing Azure Open AI On Your Data with Azure Blob Storage using the Azure Open AI Data Studio? I am thinking that perhaps: (a) the first indexer is still chunking removed documents, (b) the second indexer is failing to detect or remove deleted documents from where the chunking indexer stores them.

I have confirm that our blob storage containers and the search data sources appropriately meet the requirements outline here that should allow indexers to detect and remove deleted documents. I

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
821 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,521 questions
0 comments No comments
{count} vote

Accepted answer
  1. SnehaAgrawal-MSFT 19,676 Reputation points
    2024-06-20T04:58:07.9433333+00:00

    @Abby Greentree Thanks for reply! I reached out to Product Team and they confirmed that it is expected that for containers with a large number of documents, the deletion tracking might not take effect immediately but might take effect a few refreshes later.

    A few questions regarding the customer setup, how many documents does the customer have in their container? Also you mentioned that you followed the steps to manually enable deletion tracking on your indexers, does that mean it wasn't enabled initially?

    Also, I have reached you on private message for details-

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. SnehaAgrawal-MSFT 19,676 Reputation points
    2024-06-14T07:59:15.7933333+00:00

    @Abby Greentree Thanks for asking question!

    When your blob contains multiple documents, if you delete the blob directly, the indexer won't know about it and won't delete anything from the index.

    To make the indexer delete documents, use a soft delete deletion detection policy s like this: , for example:

    {
      "@odata.type": "#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",
      "softDeleteColumnName": "IsDeleted",
      "softDeleteMarkerValue": "true"
    }
    

    When you want to delete a document, add "IsDeleted": true to the JSON object. After the indexer detects and processes these soft deletes, you can then do a hard delete to remove the blob entirely.

    https://learn.microsoft.com/en-us/rest/api/searchservice/reset-indexer

    Let us know if issue remains.

    0 comments No comments