Mismatch in Similarity Search Results, OpenAI vs. Azure OpenAI Embeddings

firi berhane 0 Reputation points
2025-03-05T07:36:14.75+00:00

We have deployed the "text-embedding-ada-002" embedding model on our Azure resources and are using them in our codebase to create an AzureOpenAIEmbedding instance for similarity search and document retrieval. The similarity search is conducted on a vector store where the embeddings were generated using an OpenAI embedding instance, which is not from Azure.

Although the implementation runs without errors, the retrieved documents are not similar to the search query. This issue did not occur when we were using OpenAI embeddings directly. We used the same model, "text-embedding-ada-002" with the same dimension value of 1536, but from different model providers—OpenAI vs. Azure OpenAI.

Why is this happening?

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,080 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Vikram Singh 2,585 Reputation points Microsoft Employee Moderator
    2025-03-06T07:54:59.66+00:00

    Hi @firi berhane ,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    The issue you're experiencing could be due to several factors:

    1. Differences in Implementation
    2. Vector Store Configuration
    3. Normalization and Preprocessing
    4. Network and Latency Issues

    To address this issue, you can try the following steps:

    1. Verify Consistency: Ensure that the same preprocessing steps are applied to the text before generating embeddings in both implementations.
      1. Verify the exact model version used in your Azure OpenAI deployment. Check the model parameter in your code and ensure it matches the deployment name in Azure (e.g., deployment_name="your-azure-deployment-name").
      2. Ensure consistency: Normalize embeddings from both OpenAI and Azure OpenAI if your vector store uses cosine similarity.
              from sklearn.preprocessing import normalize
              embeddings = normalize(embeddings, norm="l2")  # Apply L2 normalization
        
    2. Check Vector Store Configuration: The vector store (e.g., FAISS, Azure Cognitive Search, Pinecone) might be configured with a distance metric (e.g., cosine, L2) that assumes normalized embeddings. If Azure OpenAI embeddings are unnormalized, similarity scores will be inconsistent.
      1. Confirm the distance metric used by your vector store (e.g., cosine for normalized embeddings, L2 for unnormalized). Reindex the vector store with normalized embeddings if required.
    3. Test with Sample Data: Test both services with identical input text and compare embeddings directly. Use a shared preprocessing pipeline (e.g., stripping whitespace, lowercasing) for consistency.

    The most likely cause of the mismatch in similarity search results is embedding normalization. Please test the steps mentioned above to resolve the issue.

    If the reply was helpful, please don't forget to upvote and/or Accept the answer, this can be beneficial to other community members.

    Thank you


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.