Share via

How to not unecessarily re-embedd documents in Azure AI Search?

Noah Pursell 0 Reputation points
2025-01-10T04:06:45.4266667+00:00

Hello all!

I am using Azure AI Search to store some vectorized documents. In my use case, I will receive a new set of documents periodically. I want to add these to my Azure AI Search index. However, there is a high probability that some of these documents are already in the index. I am wondering if it is possible to only add the documents that are not already in the index (primarily to save time).

I do not see any built-in function to do this (I am mainly using Python/langchain). I also do not see any easy way to get a list of all document IDs from an index (this would allow me to do the filtering locally, and only push documents whose ID is not in the retrieved IDs).

Does anyone have any suggestions? It would be much appreciated!

Azure AI Search
Azure AI Search

An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.


1 answer

Sort by: Most helpful
  1. Leo Visser 326 Reputation points MVP
    2025-01-14T06:13:50.2933333+00:00

    You can retrieve the Index values with this REST call (there might also be an SDK call for it, I personally use the rest api)
    https://learn.microsoft.com/en-us/rest/api/searchservice/indexes/get?view=rest-searchservice-2024-07-01&tabs=HTTP

    You can use this to check which ones are already present.

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.