How to reduce the influence of image text on the semantic vectors when using AI Vision Image Retrieval Endpoint

Hiob Gebisso 101

Hello,

we are currently testing the Vision Image Retrieval API on book covers and noticed that the model is heavily influenced by text (author, titles, subtitles) on an image. Is there a more straightforward way to reduce the influence of text on the model output, other than preprocessing the image with OCR to get rid of the text, before using the retrieval API?

Best,

Hiob

YutongTie-MSFT 47,416 Reputation points

2023-12-04T21:13:40.62+00:00
@Hiob Gebisso

Hello Hiob,

When working with AI Vision Image Retrieval systems that use semantic vectors, the influence of text present in images can indeed skew the results if you are primarily interested in visual features other than text. While preprocessing images to remove text is one approach, here are a few other strategies you might consider to reduce the influence of text:

Model Fine-tuning: If you have the ability to fine-tune the model, you could train it on a dataset where the text in images is not relevant to the retrieval task. By doing so, the model should learn to focus on other visual features.

Feature Masking: You can apply a mask to the regions of the image where text is present before feeding the image to the model. This way, the model will ignore these regions and focus on the rest of the image. This requires text detection but not necessarily removal.

Layer Manipulation: If you have access to the model's internals, you could manipulate the layers that are responsible for text recognition to reduce their influence or weight in the final vector representation.

Semantic Segmentation: Use a semantic segmentation model to identify and segment out the text portions of the image. Once segmented, you can fill in these regions with a neutral color or pattern before inputting the image into the retrieval system.

Adversarial Training: Introduce adversarial examples during training where the text is altered or misleading, encouraging the model to rely less on text and more on other features for the retrieval task.

Dimensionality Reduction: Post-process the semantic vectors using dimensionality reduction techniques that might reduce the influence of text-related features, like PCA or t-SNE, although this might also reduce retrieval performance if not done carefully.

Alternate Models: Use or develop models that are inherently less sensitive to text in images. For example, some models are designed to focus on textures, shapes, or other visual aspects rather than text.

Ensemble Methods: Combine outputs from multiple models where at least one model is insensitive to text. For example, one model could be trained on images with text removed, while another could be trained on the original images.

Use Metadata: If you're working with book covers and you have metadata like the title or author name, you can use this information to filter or adjust the results after the initial retrieval.

Remember, some of these approaches might require significant effort and adjustments to your current setup. It's important to balance the need to reduce text influence with the overall performance and practicality of your image retrieval system. Depending on your specific application and resources, some strategies might be more applicable than others. It's also worth considering whether a hybrid approach that combines several methods might yield the best results.

I hope this helps.

Regards,

Yutong
Hiob Gebisso 101 Reputation points

2023-12-05T15:20:16.6133333+00:00

Thanks for the ChatGPT-generated response, but it is too generic and partially inaccurate. I was rather hoping that the AI vision model endpoint itself has a parameter that can be set to determine the influence of text on the vector results. How would I even go about fine-tuning the Florence Foundation model so that it ignores text in images?
YutongTie-MSFT 47,416 Reputation points

2023-12-05T16:11:28.0333333+00:00

@Hiob Gebisso Thanks for your response, I have forwarded your feedback to team to see if there any good solution to only process the image, if you can input more about your scenario, it will be more helpful. I will get back to you once I get any response.

Regards,

Yutong

Share via

How to reduce the influence of image text on the semantic vectors when using AI Vision Image Retrieval Endpoint