GPT Vision With Grounding Not Working

Dan Hastings 20 Reputation points
2024-03-13T15:18:14.4833333+00:00

I am using the following request body

{
    "enhancements": {
            "ocr": {
              "enabled": true
            },
            "grounding": {
              "enabled": true
            }
    },
    "dataSources": [
    {
        "type": "AzureComputerVision",
        "parameters": {
            "endpoint": "XXXX",
            "key": "XXXX"
        }
    }],
    "messages": [
        {
            "role": "system",
            "content": "Using the image file provided, you will need to analyze the image based on the prompt and return an array of x and y coordinates for the top left and bottom right of each item detected. "
        },
        {
            "role": "user",
            "content": [
	            {
	                "type": "text",
	                "text": "Give me back all of the small icons for each item on the list. they should contain a small representation of the item with a solid blue shape of each gun"
	            },
	            {
	                "type": "image_url",
	                "image_url": {
                        "url":"http://yomotherboard.com/grounding-test.jpg" 
                    }
                }
           ] 
        }
    ],
    "max_tokens": 4000, 
    "stream": false 
}


The API responses with the following. It seems to have correctly picked up each item in the list as it has detected the correct name but it doesnt seem to be able to grab the correct coordinates for the icons. The icons start around 400px across.

{
    "id": "xxxx",
    "object": "chat.completion",
    "created": 1710342480,
    "model": "gpt-4",
    "prompt_filter_results": [
        {
            "prompt_index": 0,
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The array of x and y coordinates for the top left and bottom right of each small icon (with a solid blue shape) for the items in the list are as follows:\n\n1. Machine Gun: [37, 175], [91, 229]\n2. Anti-Materiel Rifle: [37, 251], [91, 305]\n3. Stalwart: [37, 327], [91, 381]\n4. Expendable Anti-Tank: [37, 403], [91, 457]\n5. Recoilless Rifle: [37, 479], [91, 533]\n6. Flamethrower: [37, 555], [91, 609]\n7. Autocannon: [37, 631], [91, 685]\n8. Railgun: [37, 707], [91, 761]\n9. Spear: [37, 783], [91, 837]"
            },
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            },
            "enhancements": {
                "grounding": {
                    "lines": [
                        {
                            "text": "The array of x and y coordinates for the top left and bottom right of each small icon (with a solid blue shape) for the items in the list are as follows:\n\n1. Machine Gun: [37, 175], [91, 229]\n2. Anti-Materiel Rifle: [37, 251], [91, 305]\n3. Stalwart: [37, 327], [91, 381]\n4. Expendable Anti-Tank: [37, 403], [91, 457]\n5. Recoilless Rifle: [37, 479], [91, 533]\n6. Flamethrower: [37, 555], [91, 609]\n7. Autocannon: [37, 631], [91, 685]\n8. Railgun: [37, 707], [91, 761]\n9. Spear: [37, 783], [91, 837]",
                            "spans": []
                        }
                    ],
                    "status": "Success"
                }
            }
        }
    ],
    "usage": {
        "prompt_tokens": 1513,
        "completion_tokens": 200,
        "total_tokens": 1713
    }
}
Azure Computer Vision
Azure Computer Vision
An Azure artificial intelligence service that analyzes content in images and video.
311 questions
Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
944 questions
{count} votes

Accepted answer
  1. navba-MSFT 17,125 Reputation points Microsoft Employee
    2024-03-24T04:26:17.17+00:00

    @Dan Hastings Apologies for the late reply. I appreciate your patience on this.

    I had a discussion internally with the Product Owners. Please find the update below:

    This issue is an expected behavior. Reasons are

    (i) model not trained on many such game images or icons.

    (ii) post-processing filters out all very small or large boxes, regardless of image content.

    **

    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more