An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
Hii @Quentin Vondel,
Thank you for contacting us regarding the issue with your Azure Cognitive Search indexer reprocessing previously indexed blob files when new documents are uploaded. This is expected behavior in Azure Cognitive Search when using Blob Storage change detection.
Your indexer is reprocessing previous PDFs because Azure uses LastModified timestamps to detect changes. When files are uploaded too quickly, their timestamps collide, and the high‑water mark rewinds, causing N recent files to be reprocessed. This is expected behavior.
Why the Issue Starts With the 3rd Document?
This behavior is consistent with Microsoft’s description that:
- The indexer evaluates blobs in lexicographic order.
- Closely timed uploads may share similar timestamps.
- When high-water mark lands between blob timestamps, blobs before or after the boundary get re-indexed. The issue might be on the Azure AI Search side the
HighWaterMarkChangeDetectionPolicywhich is modifying blob timestamps during rapid uploads, causing reprocessing of old files.
Please follow the below recommended steps
- Explicitly define change detection using
metadata_storage_last_modified - Upload files with more time spacing
- Use scheduled runs instead of manual runs
- Recreate indexer if initial configuration lacked change/delete policies.
- Use Azure Monitor + skillset logs for precise tracking.
Azure AI Search provides built-in logging for processed files, times, and triggers, but it's not exhaustive by default. Here's how to set it up properly:
Execution History (Portal):
Go to Search service > Indexers > your indexer > Execution details. View run time, duration, documents processed, successes/failures, and base64-encoded document keys (decode to get file names). Change detection is inferred when a blob’s LastModified is newer than the indexer’s high-water mark.
Enable Diagnostics: In Search service > Monitoring > Diagnostic settings, add a setting. Send logs to Log Analytics, Storage, or Event Hub and enable ExecutionAndOperations and AllLogs.
Query Logs (Log Analytics): Use KQL to see indexer runs, processed documents, and timestamps. DocumentKey is base64-encoded and can be decoded to blob paths.
AzureDiagnostics
where ResourceType == "SEARCHINDEXER"
where OperationName startswith "IndexerExecution"
REST API: Call GET /indexers/{indexer-name}/status to see last run details and the high-water mark.
Reference:
https://learn.microsoft.com/en-us/azure/search/search-how-to-create-indexers?tabs=portal
https://learn.microsoft.com/en-us/azure/search/search-how-to-index-azure-blob-changed-deleted?tabs=portal
https://learn.microsoft.com/en-us/azure/search/search-indexer-troubleshooting
https://learn.microsoft.com/en-us/azure/search/search-how-to-index-azure-blob-changed-deleted?tabs=portal
https://learn.microsoft.com/en-us/azure/search/enrichment-cache-how-to-manage
Kindly let us know if the above comment helps or you need further assistance on this issue.
Please "upvote" if the information helped you. This will help us and others in the community as well.