GPT Vision With Grounding Not Working

Question

I am using the following request body

{
    "enhancements": {
            "ocr": {
              "enabled": true
            },
            "grounding": {
              "enabled": true
            }
    },
    "dataSources": [
    {
        "type": "AzureComputerVision",
        "parameters": {
            "endpoint": "XXXX",
            "key": "XXXX"
        }
    }],
    "messages": [
        {
            "role": "system",
            "content": "Using the image file provided, you will need to analyze the image based on the prompt and return an array of x and y coordinates for the top left and bottom right of each item detected. "
        },
        {
            "role": "user",
            "content": [
	            {
	                "type": "text",
	                "text": "Give me back all of the small icons for each item on the list. they should contain a small representation of the item with a solid blue shape of each gun"
	            },
	            {
	                "type": "image_url",
	                "image_url": {
                        "url":"http://yomotherboard.com/grounding-test.jpg" 
                    }
                }
           ] 
        }
    ],
    "max_tokens": 4000, 
    "stream": false 
}

The API responses with the following. It seems to have correctly picked up each item in the list as it has detected the correct name but it doesnt seem to be able to grab the correct coordinates for the icons. The icons start around 400px across.

{
    "id": "xxxx",
    "object": "chat.completion",
    "created": 1710342480,
    "model": "gpt-4",
    "prompt_filter_results": [
        {
            "prompt_index": 0,
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The array of x and y coordinates for the top left and bottom right of each small icon (with a solid blue shape) for the items in the list are as follows:

1. Machine Gun: [37, 175], [91, 229]
2. Anti-Materiel Rifle: [37, 251], [91, 305]
3. Stalwart: [37, 327], [91, 381]
4. Expendable Anti-Tank: [37, 403], [91, 457]
5. Recoilless Rifle: [37, 479], [91, 533]
6. Flamethrower: [37, 555], [91, 609]
7. Autocannon: [37, 631], [91, 685]
8. Railgun: [37, 707], [91, 761]
9. Spear: [37, 783], [91, 837]"
            },
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            },
            "enhancements": {
                "grounding": {
                    "lines": [
                        {
                            "text": "The array of x and y coordinates for the top left and bottom right of each small icon (with a solid blue shape) for the items in the list are as follows:

1. Machine Gun: [37, 175], [91, 229]
2. Anti-Materiel Rifle: [37, 251], [91, 305]
3. Stalwart: [37, 327], [91, 381]
4. Expendable Anti-Tank: [37, 403], [91, 457]
5. Recoilless Rifle: [37, 479], [91, 533]
6. Flamethrower: [37, 555], [91, 609]
7. Autocannon: [37, 631], [91, 685]
8. Railgun: [37, 707], [91, 761]
9. Spear: [37, 783], [91, 837]",
                            "spans": []
                        }
                    ],
                    "status": "Success"
                }
            }
        }
    ],
    "usage": {
        "prompt_tokens": 1513,
        "completion_tokens": 200,
        "total_tokens": 1713
    }
}

Accepted Answer

@Dan Hastings Apologies for the late reply. I appreciate your patience on this.

I had a discussion internally with the Product Owners. Please find the update below:

This issue is an expected behavior. Reasons are

(i) model not trained on many such game images or icons.

(ii) post-processing filters out all very small or large boxes, regardless of image content.

**

Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Share via

GPT Vision With Grounding Not Working

1 additional answer

Your answer