Mismatch in Similarity Search Results, OpenAI vs. Azure OpenAI Embeddings

Question

Mismatch in Similarity Search Results, OpenAI vs. Azure OpenAI Embeddings

firi berhane 0

We have deployed the "text-embedding-ada-002" embedding model on our Azure resources and are using them in our codebase to create an AzureOpenAIEmbedding instance for similarity search and document retrieval. The similarity search is conducted on a vector store where the embeddings were generated using an OpenAI embedding instance, which is not from Azure.

Although the implementation runs without errors, the retrieved documents are not similar to the search query. This issue did not occur when we were using OpenAI embeddings directly. We used the same model, "text-embedding-ada-002" with the same dimension value of 1536, but from different model providers—OpenAI vs. Azure OpenAI.

Why is this happening?

1 answer

Your answer

Answer 1

Hi @firi berhane ,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

The issue you're experiencing could be due to several factors:

Differences in Implementation
Vector Store Configuration
Normalization and Preprocessing
Network and Latency Issues

To address this issue, you can try the following steps:

Verify Consistency: Ensure that the same preprocessing steps are applied to the text before generating embeddings in both implementations.
1. Verify the exact model version used in your Azure OpenAI deployment. Check the model parameter in your code and ensure it matches the deployment name in Azure (e.g., deployment_name="your-azure-deployment-name").
2. Ensure consistency: Normalize embeddings from both OpenAI and Azure OpenAI if your vector store uses cosine similarity.
```
      from sklearn.preprocessing import normalize
      embeddings = normalize(embeddings, norm="l2")  # Apply L2 normalization
```
Check Vector Store Configuration: The vector store (e.g., FAISS, Azure Cognitive Search, Pinecone) might be configured with a distance metric (e.g., cosine, L2) that assumes normalized embeddings. If Azure OpenAI embeddings are unnormalized, similarity scores will be inconsistent.
1. Confirm the distance metric used by your vector store (e.g., cosine for normalized embeddings, L2 for unnormalized). Reindex the vector store with normalized embeddings if required.
Test with Sample Data: Test both services with identical input text and compare embeddings directly. Use a shared preprocessing pipeline (e.g., stripping whitespace, lowercasing) for consistency.

The most likely cause of the mismatch in similarity search results is embedding normalization. Please test the steps mentioned above to resolve the issue.

If the reply was helpful, please don't forget to upvote and/or Accept the answer, this can be beneficial to other community members.

Thank you

Vikram Singh 2,585 Reputation points Microsoft Employee Moderator

2025-03-10T05:18:21.8833333+00:00

Hi @firi berhane ,

Greetings.

Just following up to check if my suggestion helped. Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Thank you
firi berhane 0 Reputation points

2025-03-10T10:21:50.24+00:00

I'm currently trying out the suggestions, will reach out if I get any issues. Thanks!
Vikram Singh 2,585 Reputation points Microsoft Employee Moderator

2025-03-13T04:02:47.39+00:00

Hi @firi berhane ,

Greetings!

We haven’t heard from you on the last response and was just checking back to see if you got a chance to try above suggestions.

Thank you.

Share via

Mismatch in Similarity Search Results, OpenAI vs. Azure OpenAI Embeddings

1 answer

Your answer