How image search happening

Akashkumar Barot (HCL TECHNOLOGIES CORPORATE SER) 40 Reputation points Microsoft External Staff
2025-05-21T21:40:13.0766667+00:00

Hello,

I'm currently using the Azure Vision API to generate vector embeddings from images, which are then compared against pre-vectorized images stored in a database for similarity search.

For some images, the search performs well and returns accurate, relevant results. However, for others, the results are either inaccurate or not closely related.

I’d like to better understand how the AI interprets images during the similarity search process. Specifically:

  • What algorithm or model is used to generate the embeddings? will it give priority to text or background image?
  • How does the system handle images that are faded, have background noise, or contain text overlays?

How does the algorithm prioritize visual elements when matching images from the database?

Any documentation or guidance on how the image content is processed and matched would be greatly appreciated.

Thank you!

Azure AI Custom Vision
Azure AI Custom Vision
An Azure artificial intelligence service and end-to-end platform for applying computer vision to specific domains.
295 questions
0 comments No comments
{count} votes

Accepted answer
  1. Vinodh247 34,741 Reputation points MVP Volunteer Moderator
    2025-05-22T04:16:13.9633333+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    Here is a breakdown of how Azure Vision API and similar systems perform image similarity search using vector embeddings, and how they interpret different image elements:


    1. What algorithm or model is used to generate the embeddings?

    Azure's Vision API, particularly under Azure Cognitive Services, Computer Vision or Azure AI Vision, uses deep CNN's, often based on ResNet, Inception, or Vision Transformers under the hood. When it comes to generating embeddings:

    For image features, Microsoft does not disclose exact architecture versions, but models are trained on large-scale datasets like ImageNet, OpenImages, or internal datasets.

    The Image Analysis API (v4.0) or multimodal embeddings in Azure AI Search use CLIP-like models (Contrastive Language–Image Pretraining), which align image and text features in the same embedding space.

    These embeddings are fixed-length vectors (typically 512 or 1024 dims) capturing global semantic information—objects, text, and even style to an extent.


    1. Does the model prioritize text or background image?

    This depends on which pipeline is used:

    • If OCR is turned on or used in parallel, text is explicitly extracted and may affect embedding similarity if the embedding model incorporates both image and text modalities.
    • In CLIP-style embeddings, text in the image can dominate the semantic similarity. For ex, a banner with bold text "SALE" may be matched with other SALE banners even if the background differs.

    By default:

    If you are using Azure AI Vision for "embedding only", the visual contents (objects, colors, patterns) are prioritized.

    If you use semantic captioning + embedding, then textual elements (visible or implied) may get higher priority.


    1. How does the system handle faded images, background noise, or text overlays?

    This is where robustness varies:

    • Faded images or background noise: Modern deep vision models are somewhat robust to lighting changes and occlusions due to their pretraining diversity. However extreme noise or blur reduces accuracy.

    Text overlays:

    If clearly legible, models like CLIP or Azure's semantic models will encode the text as part of the meaning, which may skew the similarity.

      If partially visible or stylized, the model may misinterpret or deprioritize that information.
      
    

    Inconsistent results often arise from:

    Too much reliance on textual cues from the image.

    Overfitting to style rather than substance (e.g., matching background colors or borders).


    1. How does the algorithm prioritize visual elements during matching?

    Internally, embeddings are compared using cosine similarity or Euclidean distance.

    Prioritization is emergent, not explicitly defined. But typical tendencies:

    Objects and their spatial relationships are given weight.

    • Color palettes and textures may affect stylebased matches.
    • If text is detected, it may outweigh finegrained visual content depending on model training.

    Azure does not offer fine-grained control to "turn off" or "down-weight" text, but you can pre-process images:

    Strip text overlays.

    Blur/normalize backgrounds to focus embeddings on core content.

    Use custom embedding models if Azure's default does not meet quality requirements.


    1. Recommendations for better similarity search
    • Preprocess images:
      • Use object detection to crop to the main subject before embedding.
      • Use text removal techniques (like inpainting) if text is noise.
    • Use a dual embedding strategy:
      • Extract text separately using OCR and embed it with a language model.
      • Combine that with visual embeddings only if relevant.
    • Use Azure AI Search with multimodal search, where you can weigh textual vs visual similarity.
    • Consider fine-tuning an ONNX based Vision Transformer (like ViT-B/32) or ResNet50 model using your own dataset if possible.

    hope this helps!

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.