Azure AI Search - updating documents in blob storage

Pavlos Lampadaris 0 Reputation points
2024-07-30T09:44:56.4033333+00:00

I use Azure AI search and run queries against documents that exist in the Azure blob storage. What I have found is that after modifying a document and run the indexer successfully, when I run a query against that file, the results contain both the original and the changed document which I didn't expect because there is only one file, the modified one.

The interesting thing is that the search.score is higher on the original file than the updated one which means that eventually in my application I will receive a citation containing the original file and not the updated one.

All tests are done in the Azure portal, there is no code involved.

This is the process I followed:

  • upload a document to Azure blob storage
  • run the indexer successfully
  • document was modified locally (a subtitle change) and uploaded to blob storage
  • run the indexer successfully
  • run a query on the index against the file

checking the query results:

  • the results contain both the original and modified file contents
  • the result with the original file contents has higher search.score than the result containing the modified file contents.
  • the result containing the modified file contents has null values in fields filepath, title, chunk_id and last_updated.

Notice that

  • in blob storage account has been set : Data Management / Data protection / Enable soft delete for blobs
  • in datasource has been set: Track deletions / Native blob soft delete

Has anyone experienced this behavior?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,069 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,933 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Konstantinos Passadis 19,251 Reputation points MVP
    2024-07-30T20:25:51.6733333+00:00

    Hello @Pavlos Lampadaris

    Thank you for the update!

    The first thing i have to point you is :

    Once the index is created and populated, it exists independently of your blob container, but you can rerun indexing operations to refresh your index based on changed documents. Timestamp information on individual blobs is used for change detection. You can opt for either scheduled execution or on-demand indexing as the refresh mechanism.

    AND:

    After an initial search index is created, you might want subsequent indexer jobs to only pick up new and changed documents. For indexed content that originates from Azure Storage, change detection occurs automatically because indexers keep track of the last update using the built-in timestamps on objects and files in Azure Storage.

    Although change detection is a given, deletion detection isn't. An indexer doesn't track object deletion in data sources. To avoid having orphan search documents, you can implement a "soft delete" strategy that results in deleting search documents first, with physical deletion in Azure Storage following as a second step.

    In your case since the change is done outside the Index source , and the previous document is already 'broken part' the Index has kept the info, and it is adding the new title as new info , not updated one. Try to make the Title , Filterable , facetable & searchable and lets see what is happening. i will try to re create the issue in the meantime !

    --

    I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards


  2. Konstantinos Passadis 19,251 Reputation points MVP
    2024-08-04T00:06:16.99+00:00

    Hello @Pavlos Lampadaris

    Did you tried another approach as we discussed ?

    Kindly let us know

    In case any answer was helpful kindly set it as Accepted and Upvote !

    It can help others as well overcome similar issues !

    Thank you !


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.