Change and delete detection using indexers for Azure Storage in Azure Cognitive Search

After an initial search index is created, you might want subsequent indexer jobs to only pick up new and changed documents. For indexed content that originates from Azure Storage, change detection occurs automatically because indexers keep track of the last update using the built-in timestamps on objects and files in Azure Storage.

Although change detection is a given, deletion detection is not. An indexer doesn't track object deletion in data sources. To avoid having orphan search documents, you can implement a "soft delete" strategy that results in deleting search documents first, with physical deletion in Azure Storage following as a second step.

There are two ways to implement a soft delete strategy:

Prerequisites

  • Use an Azure Storage indexer for Blob Storage, Table Storage, File Storage, or Data Lake Storage Gen2

  • Use consistent document keys and file structure. Changing document keys or directory names and paths (applies to ADLS Gen2) breaks the internal tracking information used by indexers to know which content was indexed, and when it was last indexed.

Note

ADLS Gen2 allows directories to be renamed. When a directory is renamed, the timestamps for the blobs in that directory do not get updated. As a result, the indexer will not re-index those blobs. If you need the blobs in a directory to be reindexed after a directory rename because they now have new URLs, you will need to update the LastModified timestamp for all the blobs in the directory so that the indexer knows to re-index them during a future run. The virtual directories in Azure Blob Storage cannot be changed, so they do not have this issue.

Native blob soft delete (preview)

For this deletion detection approach, Cognitive Search depends on the native blob soft delete feature in Azure Blob Storage to determine whether blobs have transitioned to a soft deleted state. When blobs are detected in this state, a search indexer uses this information to remove the corresponding document from the index.

Important

Support for native blob soft delete is in preview under Supplemental Terms of Use. The REST API version 2020-06-30-Preview provides this feature. There is currently no .NET SDK support.

Requirements for native soft delete

  • Enable soft delete for blobs.
  • Blobs must be in an Azure Blob Storage container. The Cognitive Search native blob soft delete policy is not supported for blobs in ADLS Gen2.
  • Document keys for the documents in your index must be mapped to either be a blob property or blob metadata.
  • You must use the preview REST API (api-version=2020-06-30-Preview) or the indexer Data Source configuration in your Cognitive Search Service from the Azure portal, to configure support for soft delete.

How to configure deletion detection using native soft delete

  1. In Blob storage, when enabling soft delete, set the retention policy to a value that's much higher than your indexer interval schedule. This way if there's an issue running the indexer or if you have a large number of documents to index, there's plenty of time for the indexer to eventually process the soft deleted blobs. Azure Cognitive Search indexers will only delete a document from the index if it processes the blob while it's in a soft deleted state.

  2. In Cognitive Search, set a native blob soft deletion detection policy on the data source. You can do this either from the Azure portal or by using preview REST API (api-version=2020-06-30-Preview).

  1. Sign in to Azure portal.

  2. On the Cognitive Search service Overview page, go to New Data Source, a visual editor for specifying a data source definition.

    The following screenshot shows where you can find this feature in the portal.

    Screenshot of portal data source.

  3. On the New Data Source form, fill out the required fields, select the Track deletions checkbox and choose Native blob soft delete. Then hit Save to enable the feature on Data Source creation.

    Screenshot of portal data source native soft delete.