Hi @firi berhane ,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
The issue you're experiencing could be due to several factors:
- Differences in Implementation
- Vector Store Configuration
- Normalization and Preprocessing
- Network and Latency Issues
To address this issue, you can try the following steps:
- Verify Consistency: Ensure that the same preprocessing steps are applied to the text before generating embeddings in both implementations.
- Verify the exact model version used in your Azure OpenAI deployment. Check the
model
parameter in your code and ensure it matches the deployment name in Azure (e.g.,deployment_name="your-azure-deployment-name"
). - Ensure consistency: Normalize embeddings from both OpenAI and Azure OpenAI if your vector store uses cosine similarity.
from sklearn.preprocessing import normalize embeddings = normalize(embeddings, norm="l2") # Apply L2 normalization
- Verify the exact model version used in your Azure OpenAI deployment. Check the
- Check Vector Store Configuration: The vector store (e.g., FAISS, Azure Cognitive Search, Pinecone) might be configured with a distance metric (e.g., cosine, L2) that assumes normalized embeddings. If Azure OpenAI embeddings are unnormalized, similarity scores will be inconsistent.
- Confirm the distance metric used by your vector store (e.g.,
cosine
for normalized embeddings,L2
for unnormalized). Reindex the vector store with normalized embeddings if required.
- Confirm the distance metric used by your vector store (e.g.,
- Test with Sample Data: Test both services with identical input text and compare embeddings directly. Use a shared preprocessing pipeline (e.g., stripping whitespace, lowercasing) for consistency.
The most likely cause of the mismatch in similarity search results is embedding normalization. Please test the steps mentioned above to resolve the issue.
If the reply was helpful, please don't forget to upvote and/or Accept the answer, this can be beneficial to other community members.
Thank you