How do I delete all records where my date_ymd column in Azure Search Index is equal to a specific date?

Question

How do I delete all records where my date_ymd column in Azure Search Index is equal to a specific date?

Aravind Vijay 20

Hi, I have an issue where I'm collecting a lot of data on a scheduled script that will store the data in a the Azure Search Index as a vector db and I then use RAG to collect data from this based on a prompt sent by a user in a chatbot and the AI's response is based on the top N documents from the storage.

My first question is if there's a way to workaround just using the top N documents from Azure Search Index. I get the top 50 documents for a users prompt and feed it to my chatbot's system prompt. Is there anyway to link my streamlit chatbot to the Azure Search Index directly without feeding only N documents.

Secondly and more importantly. I need the code to delete all documents for specific date_ymd values. Keep in mind all my columns and keys are string type and not date type. So Can you help with creating the script to delete values which have a certain string date.

This is my code for uploading documents:

def chunk_data(data, chunk_size):
    for i in range(0, len(data), chunk_size):
        yield data[i:i + chunk_size]


def upload_documents_to_search_client(df, embeddings_dict, chunk_size=32000):
    """Uploads documents with embeddings to the search client in chunks."""
    data = [
        {
            "@search.action": "mergeOrUpload",
            "hardware_id": str(row["hardware_id"]),
            "text_feedback": str(row["text_feedback"]) if "text_feedback" in row else "",
            "uninstall_text_feedback": str(row["uninstall_text_feedback"]) if "uninstall_text_feedback" in row else "",
            "os": str(row["os"]) if "os" in row else "",
            "date_ymd": str(row["date_ymd"]) if "date_ymd" in row else "",
            "Feature_Category": str(row["Feature_Category"]) if "Feature_Category" in row else "",
            "Sentiment": str(row["Sentiment"]) if "Sentiment" in row else "",
            "country": str(row["country"]) if "country" in row else "",
            "aiid": str(map_aiid_to_label(row["aiid"])) if "aiid" in row else "",
            "version_app": str(row["version_app"]) if "version_app" in row else "",
            "os_version": str(row["version"]) if "version" in row else "",
            "architecture": str(row["architecture"]) if "architecture" in row else "",
            "score": str(row["score"]) if "score" in row else "",
            "region": str(row["region"]) if "region" in row else "",
            "city": str(row["city"]) if "city" in row else "",
            "vector_text_feedback": next(
                (item["embeddings"].get("vector_text_feedback", []) for item in embeddings_dict if item["hardware_id"] == str(row["hardware_id"])),
                []
            ),
            "vector_uninstall_feedback": next(
                (item["embeddings"].get("vector_uninstall_feedback", []) for item in embeddings_dict if item["hardware_id"] == str(row["hardware_id"])),
                []
            )
        }
        for _, row in df.iterrows()
    ]
    for chunk in chunk_data(data, chunk_size):
        try:
            result = search_client.upload_documents(documents=chunk)
            print(f"Uploaded {len(chunk)} documents successfully.")
        except Exception as e:
            print(f"An error occurred during document upload: {e}")
            return None

Bhargavi Naragani 5,270 Reputation points Microsoft External Staff Moderator

2025-04-04T12:28:36.8633333+00:00

Hi @Aravind Vijay,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Bhargavi Naragani 5,270 Reputation points Microsoft External Staff Moderator

2025-04-07T02:22:57.5233333+00:00

Hi @Aravind Vijay,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Accepted answer

0 additional answers

Your answer

Bhargavi Naragani 5,270 Reputation points Microsoft External Staff Moderator

2025-04-04T12:28:36.8633333+00:00

Hi @Aravind Vijay,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Bhargavi Naragani 5,270 Reputation points Microsoft External Staff Moderator

2025-04-07T02:22:57.5233333+00:00

Hi @Aravind Vijay,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Bhargavi Naragani 5,270 Microsoft External Staff Moderator

Hi @Aravind Vijay,

Currently, Azure Cognitive Search retrieves a specified number of top documents based on relevance. Directly integrating your Streamlit chatbot with the Azure Search Index to access more than the top N documents isn't natively supported. However, you can implement pagination to retrieve additional documents beyond the initial set. This involves making successive search queries with appropriate skip and top parameters to navigate through the result set. By aggregating these results, you can provide your chatbot with a broader context. https://learn.microsoft.com/en-us/azure/search/search-pagination-page-layout

To remove documents where the date_ymd field matches a specific date, you'll need to perform a two-step process. Since Azure Cognitive Search requires the document's key field (e.g., hardware_id) for deletion, you must first query the index to obtain the keys of documents matching your date_ymd criteria. Once you have the list of keys, you can issue delete operations for those specific documents.

Here's how you can implement this in Python using the Azure Search SDK:

from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
# Initialize the SearchClient
service_endpoint = "https://<your-service-name>.search.windows.net"
index_name = "<your-index-name>"
api_key = "<your-api-key>"
search_client = SearchClient(service_endpoint, index_name, AzureKeyCredential(api_key))
def delete_documents_by_date(date_ymd):
    # Step 1: Retrieve documents with the specified date_ymd
    filter_expression = f"date_ymd eq '{date_ymd}'"
    results = search_client.search(search_text="", filter=filter_expression, select=["hardware_id"])
    # Step 2: Collect the keys of the documents to be deleted
    documents_to_delete = [{"hardware_id": doc["hardware_id"]} for doc in results]
    # Step 3: Delete the documents in batches
    if documents_to_delete:
        batch_size = 1000  # Adjust batch size as needed
        for i in range(0, len(documents_to_delete), batch_size):
            batch = documents_to_delete[i:i + batch_size]
            for doc in batch:
                doc["@search.action"] = "delete"
            search_client.upload_documents(documents=batch)
            print(f"Deleted batch of {len(batch)} documents.")
    else:
        print("No documents found with the specified date.")

Since your date_ymd field is of string type, make sure that the format of the date in your query is identical to the format in your index (i.e., 'YYYY-MM-DD'). Azure Search imposes batch size limits. It's recommended to execute deletions in batches (e.g., 1,000 documents per batch) so as not to exceed these limits. Deletions are executed asynchronously. There may be a slight delay before the updates are applied in the index.

Refer to the Azure AI Search documentation on adding, updating, or deleting documents for better understanding.

Hope the above provided information help you resolve the issue, if you have any further concerns or queries, please feel free to reach out to us.

Aravind Vijay 20 Reputation points

2025-05-02T07:33:59.18+00:00

Thank you so much for your response, I'll have to try out your solution and see if I'm able to resolve the issue
Bhargavi Naragani 5,270 Reputation points Microsoft External Staff Moderator

2025-05-05T04:44:33.2566667+00:00

Hi Aravind Vijay, Any update?

Aravind Vijay 20

Yes, I'm not sure if skip and top can be used in my case as I understand its for sets of pages. But my search index is entirely JSON data. I can share how I've used the RAG, could you help with assigning the pagination?

query=f"{prompt} {filter_values}".strip()
            vector_query = VectorizedQuery(vector=generate_embeddings(query),k_nearest_neighbors=100,fields="vector_text_feedback,vector_uninstall_feedback",exhaustive=True)
            results = search_client.search(
                search_text=query,
                vector_queries=[vector_query],
                top=150,
                query_type=QueryType.SEMANTIC,  
                semantic_configuration_name="name",
            )
            documents = [result for result in results]
            filtered_data = [
                {k: v for k, v in doc.items() if not k.startswith("@search")}
                for doc in documents
            ]
            filtered_data_json = json.dumps(filtered_data, indent=2)

Bhargavi Naragani 5,270 Reputation points Microsoft External Staff Moderator

2025-05-06T05:22:18.63+00:00
Hi Aravind Vijay,

Azure AI Search employs server-side paging to manage the volume of documents returned in a single query. The top parameter specifies the number of results to return, while the skip parameter indicates how many results to bypass. This mechanism allows you to paginate through large result sets efficiently.

In your current implementation, you're retrieving the top 150 documents. To access additional documents beyond this initial set, you can utilize the skip parameter in subsequent queries.

To paginate through your vector search results, you can modify your code to include the skip parameter. Here's how you can do it:

def fetch_paginated_results(query, vector_query, page_size=150, max_results=1000): all_results = [] for skip in range(0, max_results, page_size): results = search_client.search( search_text=query, vector_queries=[vector_query], top=page_size, skip=skip, query_type=QueryType.SEMANTIC, semantic_configuration_name="name", ) batch = [result for result in results] if not batch: break all_results.extend(batch) return all_results

In this function:

page_size determines how many results to retrieve per page.

max_results sets an upper limit on the total number of results to fetch.

The loop increments the skip value by page_size to paginate through the results.

Azure AI Search has a maximum limit of 1,000 results per query. Ensure that max_results does not exceed this limit. Paginating through large datasets may impact performance. Monitor and optimize as needed.

Kindly refer to the below documentations for better understanding:
Shape search results in Azure AI Search
Vector search overview

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Aravind Vijay 20 Reputation points

2025-05-14T11:36:51.79+00:00

I have one last doubt. Right now in my search index I have more than 200k records. So when I use a query is it still impossible for it to skip and find the top 150 of all of them, as you said the max limit is a 1000?
Bhargavi Naragani 5,270 Reputation points Microsoft External Staff Moderator

2025-05-14T14:14:49.11+00:00
Hi Aravind Vijay,

Even if your index contains over 200,000 documents, any individual query using top and skip cannot access beyond the first 1,000 documents. This limitation is in place to ensure optimal performance and resource utilization.

While the 1,000-document limit per query is a constraint, there are strategies to access a broader set of documents:

Azure AI Search provides continuation tokens for paginating through results beyond the initial 1,000 documents. By using these tokens, you can iteratively retrieve subsequent pages of results. This approach is particularly useful when processing large datasets in batches.

By applying filters and sorting criteria to your queries, you can segment your data into smaller, more manageable subsets. This allows you to perform multiple queries, each retrieving a different segment of your data, effectively circumventing the 1,000-document limit per query.

For scenarios requiring extensive data retrieval and analysis, integrating Azure Data Explorer with Azure AI Search can provide more advanced querying capabilities and handle larger datasets efficiently.

https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents

If this answers your query, do click Accept Answer for was this answer helpful to close this thread. And, if you have any further query do let us know.
Aravind Vijay 20 Reputation points

2025-06-03T13:06:38.7733333+00:00
So how exactly would I change the code to access the broader set of documents because top and skip don't actually directly go beyond the 1000 set of documents. Also right now when im querying the data, it does bring results that were beyond 1000 documents ago ( As in months ago), so I'm not sure what you meant by the 1000 document limitation. So even with this above solution you posted:

def fetch_paginated_results(query, vector_query, page_size=150, max_results=1000): all_results = [] for skip in range(0, max_results, page_size): results = search_client.search( search_text=query, vector_queries=[vector_query], top=page_size, skip=skip, query_type=QueryType.SEMANTIC, semantic_configuration_name="name", ) batch = [result for result in results] if not batch: break all_results.extend(batch) return all_results

Does this still not actually go through the 200k documents using skip?

Once again thank you for your continued help and replies
Bhargavi Naragani 5,270 Reputation points Microsoft External Staff Moderator

2025-06-03T14:51:03.0933333+00:00
Azure AI Search employs server-side paging to manage large result sets efficiently. The top parameter specifies the number of results to return, while the skip parameter indicates how many results to bypass. However, it's important to note that Azure AI Search imposes a maximum limit of 1,000 documents that can be retrieved in a single query using these parameters. This means that even if your index contains over 200,000 documents, any individual query using top and skip cannot access beyond the first 1,000 documents. This limitation is in place to ensure optimal performance and resource utilization.

You mentioned that your queries return documents from months ago, which seems to contradict the 1,000-document limitation. This is possible because Azure AI Search ranks documents based on relevance to the query, not their position in the index. Therefore, even older documents can appear in the top results if they are deemed highly relevant.

If you need to access more than 1,000 documents, as mentioned in my previous response consider the following approaches:

Azure AI Search provides continuation tokens for paginating through results beyond the initial 1,000 documents. By using these tokens, you can iteratively retrieve subsequent pages of results.

By applying filters and sorting criteria to your queries, you can segment your data into smaller, more manageable subsets. This allows you to perform multiple queries, each retrieving a different segment of your data, effectively circumventing the 1,000-document limit per query.

For scenarios requiring extensive data retrieval and analysis, integrating Azure Data Explorer with Azure AI Search can provide more advanced querying capabilities and handle larger datasets efficiently.

Here's an example of how you might implement continuation tokens in your Python code:

from azure.search.documents import SearchClient from azure.core.credentials import AzureKeyCredential # Initialize the SearchClient search_client = SearchClient(endpoint="https://<your-service-name>.search.windows.net", index_name="<your-index-name>", credential=AzureKeyCredential("<your-api-key>")) # Define your search parameters search_text = "*" results = [] continuation_token = None while True: response = search_client.search(search_text=search_text, top=1000, continuation_token=continuation_token) results.extend([doc for doc in response]) continuation_token = response.get_continuation_token() if not continuation_token: break # 'results' now contains all retrieved documents

In this code, the continuation_token is used to paginate through the search results, allowing you to retrieve more than 1,000 documents by iteratively fetching subsequent pages.

Shape search results in Azure AI Search
Search Documents - Azure AI Search REST API

If this answers your query, do click Accept Answer and Yes for was this answer helpful to close up this issue. And, if you have any further query do let us know.

Share via

How do I delete all records where my date_ymd column in Azure Search Index is equal to a specific date?

0 additional answers

Your answer