GPT Vision With Grounding Not Working

Question

GPT Vision With Grounding Not Working

Dan Hastings 20

I am using the following request body

{
    "enhancements": {
            "ocr": {
              "enabled": true
            },
            "grounding": {
              "enabled": true
            }
    },
    "dataSources": [
    {
        "type": "AzureComputerVision",
        "parameters": {
            "endpoint": "XXXX",
            "key": "XXXX"
        }
    }],
    "messages": [
        {
            "role": "system",
            "content": "Using the image file provided, you will need to analyze the image based on the prompt and return an array of x and y coordinates for the top left and bottom right of each item detected. "
        },
        {
            "role": "user",
            "content": [
	            {
	                "type": "text",
	                "text": "Give me back all of the small icons for each item on the list. they should contain a small representation of the item with a solid blue shape of each gun"
	            },
	            {
	                "type": "image_url",
	                "image_url": {
                        "url":"http://yomotherboard.com/grounding-test.jpg" 
                    }
                }
           ] 
        }
    ],
    "max_tokens": 4000, 
    "stream": false 
}

The API responses with the following. It seems to have correctly picked up each item in the list as it has detected the correct name but it doesnt seem to be able to grab the correct coordinates for the icons. The icons start around 400px across.

{
    "id": "xxxx",
    "object": "chat.completion",
    "created": 1710342480,
    "model": "gpt-4",
    "prompt_filter_results": [
        {
            "prompt_index": 0,
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The array of x and y coordinates for the top left and bottom right of each small icon (with a solid blue shape) for the items in the list are as follows:\n\n1. Machine Gun: [37, 175], [91, 229]\n2. Anti-Materiel Rifle: [37, 251], [91, 305]\n3. Stalwart: [37, 327], [91, 381]\n4. Expendable Anti-Tank: [37, 403], [91, 457]\n5. Recoilless Rifle: [37, 479], [91, 533]\n6. Flamethrower: [37, 555], [91, 609]\n7. Autocannon: [37, 631], [91, 685]\n8. Railgun: [37, 707], [91, 761]\n9. Spear: [37, 783], [91, 837]"
            },
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            },
            "enhancements": {
                "grounding": {
                    "lines": [
                        {
                            "text": "The array of x and y coordinates for the top left and bottom right of each small icon (with a solid blue shape) for the items in the list are as follows:\n\n1. Machine Gun: [37, 175], [91, 229]\n2. Anti-Materiel Rifle: [37, 251], [91, 305]\n3. Stalwart: [37, 327], [91, 381]\n4. Expendable Anti-Tank: [37, 403], [91, 457]\n5. Recoilless Rifle: [37, 479], [91, 533]\n6. Flamethrower: [37, 555], [91, 609]\n7. Autocannon: [37, 631], [91, 685]\n8. Railgun: [37, 707], [91, 761]\n9. Spear: [37, 783], [91, 837]",
                            "spans": []
                        }
                    ],
                    "status": "Success"
                }
            }
        }
    ],
    "usage": {
        "prompt_tokens": 1513,
        "completion_tokens": 200,
        "total_tokens": 1713
    }
}

navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-03-14T03:56:44.2566667+00:00

@Dan Hastings Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

Could you please test with a different image with a better quality and check if you are encountering the incorrect coordinates ? Awaiting your reply.
Dan Hastings 20 Reputation points

2024-03-14T15:49:32.06+00:00

@navba-MSFT i have tried with other images of a similar UI. These images are 3840x2160px, if this is considered too low of a resolution to work with this product, what is the minimum resolution supported?
Dan Hastings 20 Reputation points

2024-03-14T15:51:31.3666667+00:00

@navba-MSFT the current image is 3840x2160, what is the minimum resolution an image needs to be? I have tried using other similar images and it cant seem to figure out how to identify the smaller icons from the list.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-03-15T05:36:18.0966667+00:00
@Dan Hastings Thanks for getting back.

The Azure OpenAI GPT-4 Turbo with Vision model doesn’t have a specific minimum resolution requirement for images. However, the model’s ability to accurately identify and locate objects in an image can be influenced by the image’s resolution

Low resolution accuracy: When images are analyzed using the "low resolution" setting, it allows for faster responses and uses fewer input tokens for certain use cases. *However, this could impact the accuracy of object and text recognition within the image.*

When the low setting is enabled, the model processes a lower resolution 512x512 version of the image. This setting results in quicker responses and reduced token consumption for scenarios where fine detail isn’t crucial1. If the icons in your image are very small and close together, it might be challenging for the model to accurately identify and locate them at this lower resolution.

If you’re dealing with images where fine detail is important, you might want to consider using the high setting instead of low. The high setting would allow the model to process a higher resolution version of the image, which could potentially improve the accuracy of icon identification and location.

More info here.

Remember, the low and high settings are part of the enhancements parameter in your request body. You can adjust these settings based on the specific requirements of your image analysis task. Awaiting your reply.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-03-18T04:46:50.6233333+00:00

@Dan Hastings Just following up to check if my suggestion helped. Please let me know if you have any further queries. I would be happy to help.
Dan Hastings 20 Reputation points

2024-03-19T12:20:49.6333333+00:00

@navba-MSFT Still not working, even when the quality is set to high. Based on the documentation though, it sounds like it resizes the image. My image is 3840 x 2160. It sounds like on high quality this gets converted to 2048x2048. So when the response says the thing i am looking for is at 100px/600px top left and 150px/650px bottom right, those coordinates are relative to the resized image and not the full width image that i then apply those coordinates to?
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-03-19T14:25:37.0166667+00:00

@Dan Hastings Thanks for your reply. I am checking this at my end. I will get back to you once I have more details.

Accepted answer

1 additional answer

Your answer

navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-03-14T03:56:44.2566667+00:00

@Dan Hastings Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

Could you please test with a different image with a better quality and check if you are encountering the incorrect coordinates ? Awaiting your reply.
Dan Hastings 20 Reputation points

2024-03-14T15:49:32.06+00:00

@navba-MSFT i have tried with other images of a similar UI. These images are 3840x2160px, if this is considered too low of a resolution to work with this product, what is the minimum resolution supported?
Dan Hastings 20 Reputation points

2024-03-14T15:51:31.3666667+00:00

@navba-MSFT the current image is 3840x2160, what is the minimum resolution an image needs to be? I have tried using other similar images and it cant seem to figure out how to identify the smaller icons from the list.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-03-15T05:36:18.0966667+00:00

@Dan Hastings Thanks for getting back.

The Azure OpenAI GPT-4 Turbo with Vision model doesn’t have a specific minimum resolution requirement for images. However, the model’s ability to accurately identify and locate objects in an image can be influenced by the image’s resolution

Low resolution accuracy: When images are analyzed using the "low resolution" setting, it allows for faster responses and uses fewer input tokens for certain use cases. *However, this could impact the accuracy of object and text recognition within the image.*

When the low setting is enabled, the model processes a lower resolution 512x512 version of the image. This setting results in quicker responses and reduced token consumption for scenarios where fine detail isn’t crucial1. If the icons in your image are very small and close together, it might be challenging for the model to accurately identify and locate them at this lower resolution.

If you’re dealing with images where fine detail is important, you might want to consider using the high setting instead of low. The high setting would allow the model to process a higher resolution version of the image, which could potentially improve the accuracy of icon identification and location.

More info here.

Remember, the low and high settings are part of the enhancements parameter in your request body. You can adjust these settings based on the specific requirements of your image analysis task. Awaiting your reply.
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-03-18T04:46:50.6233333+00:00

@Dan Hastings Just following up to check if my suggestion helped. Please let me know if you have any further queries. I would be happy to help.
Dan Hastings 20 Reputation points

2024-03-19T12:20:49.6333333+00:00

@navba-MSFT Still not working, even when the quality is set to high. Based on the documentation though, it sounds like it resizes the image. My image is 3840 x 2160. It sounds like on high quality this gets converted to 2048x2048. So when the response says the thing i am looking for is at 100px/600px top left and 150px/650px bottom right, those coordinates are relative to the resized image and not the full width image that i then apply those coordinates to?
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-03-19T14:25:37.0166667+00:00

@Dan Hastings Thanks for your reply. I am checking this at my end. I will get back to you once I have more details.

Answer 1

navba-MSFT 27,550 Microsoft Employee Moderator

@Dan Hastings Apologies for the late reply. I appreciate your patience on this.

I had a discussion internally with the Product Owners. Please find the update below:

This issue is an expected behavior. Reasons are

(i) model not trained on many such game images or icons.

(ii) post-processing filters out all very small or large boxes, regardless of image content.

**

Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Answer 2

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Share via

GPT Vision With Grounding Not Working

1 additional answer

Your answer