Share via

How to delete specific documents from a search index ?

Aravind Vijayaraghavan 40 Reputation points
2025-01-22T10:42:29.4966667+00:00

I tried to delete some few documents beyond a certain date for my date field, but it ended up deleting a lot more. I realised its because my date field is just string values so it ended up in so much deletion. How do I delete specific documents for greater than or lesser than values for certain fields or for string fields specifically? All my fields are strings and vectors. This is my current code:

def chunk_data(data, chunk_size):
    """Helper function to chunk data into smaller pieces."""
    for i in range(0, len(data), chunk_size):
        yield data[i:i + chunk_size]


def delete_documents_from_search_client(date_ymd, chunk_size=32000):
    """Deletes documents from the search index with 'date_ymd' equal or below the specified date."""
    query = f"date_ymd le '{date_ymd}'"
    results = search_client.search(query)  
    
    data_to_delete = []

    for result in results:
        document_id = result["hardware_id"]
        data_to_delete.append({
            "@search.action": "delete",
            "hardware_id": document_id
        })

    for chunk in chunk_data(data_to_delete, chunk_size):
        try:
            result = search_client.upload_documents(documents=chunk)
            print(f"Deleted {len(chunk)} documents successfully.")
        except HttpResponseError as e:
            print(f"An error occurred during document delete: {e}")
            return None

delete_documents_from_search_client("2024-12-02")
Azure AI Search
Azure AI Search

An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.

0 comments No comments

1 answer

Sort by: Most helpful
  1. Chakaravarthi Rangarajan Bhargavi 1,280 Reputation points MVP
    2025-04-21T03:30:30.8933333+00:00

    Hi Aravind Vijayaraghavan,

    Thanks for your question. Just jumping in—even though this question was asked a while ago, the issue is still relevant today. Here's something that might help others too!

    The issue you're facing—where deleting documents based on a date_ymd filter ends up deleting more than expected—is a common challenge when working with string-based date fields in Azure Cognitive Search. Let's walk through the reasons and the recommended solutions.

    The root rause of the problem would be your current implementation uses a string field (date_ymd) to represent date values. However, string comparisons are lexicographical, not chronological. This means:

    "2024-10-01" > "2024-2-01"  # True (because "1" > "2")
    

    If any of your dates are not zero-padded (i.e., 2024-2-01 instead of 2024-02-01), your range filter will misbehave and include unintended documents in the result.

    Recommended Solutions:

    1. Use Proper Date Type in Your Index

    To accurately compare dates, your date_ymd field should be stored as Edm.DateTimeOffset, which is the standard datetime type supported by Azure Cognitive Search.

    Update your index schema:

    {
      "name": "date_ymd",
      "type": "Edm.DateTimeOffset",
      "filterable": true,
      "sortable": true
    }
    

    Send your date field in ISO 8601 format:

    {
      "date_ymd": "2024-12-02T00:00:00Z"
    }
    
    • Updated Python filter:
        query = f"date_ymd le {date_ymd}T00:00:00Z"
      

    Reference: Use structured data types in Azure AI Search

    1. Continue with String Field (Only if Absolutely Necessary)

    If you're unable to change your schema right now, ensure:

    All dates are formatted as YYYY-MM-DD

    No inconsistent values like 2024-2-01, "", or null

    To make your deletion safer:

    Add a ne '' clause to avoid blank entries

    Inspect results before deletion

    Improved Code Sample

    def delete_documents_from_search_client(date_ymd, chunk_size=32000):
        """
        Deletes documents from the search index where 'date_ymd' <= given date.
        Assumes date_ymd is in 'YYYY-MM-DD' format.
        """
        query = f"date_ymd le '{date_ymd}' and date_ymd ne ''"
        results = search_client.search(query)
        
        data_to_delete = []
    
        for result in results:
            document_id = result["hardware_id"]
            data_to_delete.append({
                "@search.action": "delete",
                "hardware_id": document_id
            })
    
        print(f"Found {len(data_to_delete)} documents to delete.")
    
        for chunk in chunk_data(data_to_delete, chunk_size):
            try:
                result = search_client.upload_documents(documents=chunk)
                print(f"Deleted {len(chunk)} documents successfully.")
            except HttpResponseError as e:
                print(f"An error occurred during document delete: {e}")
                return None
    

    As part of the next steps, Please check if possible, migrate to Edm.DateTimeOffset for better control and reliability in your filtering logic. Let me know if you’d like help with schema migration or sample scripts to convert existing date strings to datetime format.

    Regards,

    Chakravarthi Rangarajan Bhargavi

    - If this answer helped, please click 'Yes' and accept the answer to help others in the community. Thank you! 😊

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.