An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
Thanks for your question. Just jumping in—even though this question was asked a while ago, the issue is still relevant today. Here's something that might help others too!
The issue you're facing—where deleting documents based on a date_ymd filter ends up deleting more than expected—is a common challenge when working with string-based date fields in Azure Cognitive Search. Let's walk through the reasons and the recommended solutions.
The root rause of the problem would be your current implementation uses a string field (date_ymd) to represent date values. However, string comparisons are lexicographical, not chronological. This means:
"2024-10-01" > "2024-2-01" # True (because "1" > "2")
If any of your dates are not zero-padded (i.e., 2024-2-01 instead of 2024-02-01), your range filter will misbehave and include unintended documents in the result.
Recommended Solutions:
- Use Proper Date Type in Your Index
To accurately compare dates, your date_ymd field should be stored as Edm.DateTimeOffset, which is the standard datetime type supported by Azure Cognitive Search.
Update your index schema:
{
"name": "date_ymd",
"type": "Edm.DateTimeOffset",
"filterable": true,
"sortable": true
}
Send your date field in ISO 8601 format:
{
"date_ymd": "2024-12-02T00:00:00Z"
}
- Updated Python filter:
query = f"date_ymd le {date_ymd}T00:00:00Z"
Reference: Use structured data types in Azure AI Search
- Continue with String Field (Only if Absolutely Necessary)
If you're unable to change your schema right now, ensure:
All dates are formatted as YYYY-MM-DD
No inconsistent values like 2024-2-01, "", or null
To make your deletion safer:
Add a ne '' clause to avoid blank entries
Inspect results before deletion
Improved Code Sample
def delete_documents_from_search_client(date_ymd, chunk_size=32000):
"""
Deletes documents from the search index where 'date_ymd' <= given date.
Assumes date_ymd is in 'YYYY-MM-DD' format.
"""
query = f"date_ymd le '{date_ymd}' and date_ymd ne ''"
results = search_client.search(query)
data_to_delete = []
for result in results:
document_id = result["hardware_id"]
data_to_delete.append({
"@search.action": "delete",
"hardware_id": document_id
})
print(f"Found {len(data_to_delete)} documents to delete.")
for chunk in chunk_data(data_to_delete, chunk_size):
try:
result = search_client.upload_documents(documents=chunk)
print(f"Deleted {len(chunk)} documents successfully.")
except HttpResponseError as e:
print(f"An error occurred during document delete: {e}")
return None
As part of the next steps, Please check if possible, migrate to Edm.DateTimeOffset for better control and reliability in your filtering logic. Let me know if you’d like help with schema migration or sample scripts to convert existing date strings to datetime format.
Regards,
Chakravarthi Rangarajan Bhargavi
- If this answer helped, please click 'Yes' and accept the answer to help others in the community. Thank you! 😊