How does Azure OpenAI Studio Chat Playground handle images in tokens?

Question

How does Azure OpenAI Studio Chat Playground handle images in tokens?

Oechslein, Marius 20

In the Azure OpenAI Chat Playground, images are being uploaded for GPT-4o to generate descriptions. According to the OpenAI Vision documentation, images are processed in either "low" or "high" resolution mode:

Low resolution mode encodes the image to 85 tokens.
High resolution mode encodes the image in 85 tokens plus 170 token chunks for each 512x512px region for more detail.

However, my experience in the Azure OpenAI Chat Playground is as follows:

When a 1632x920 px image (313.6 KB) is pasted into the prompt, it reports 70 tokens used.
Upon sending the prompt, it indicates 210 tokens used.

This raises questions about how the Azure OpenAI Chat Playground processes these images, as the token counts do not align with the OpenAI Vision documentation. Additional information on this topic has been elusive. An explanation or direction to the correct resources would be greatly appreciated. Thank you.

AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-11-12T13:48:30.4766667+00:00

Oechslein, Marius Greetings & Welcome to Microsoft Q&A forum!

I understand your concern.

The additional tokens could be attributed to metadata or context information that is added when the image is processed and sent. This might include information about the image, additional text, or other contextual data required for the model to generate a description.

Could you please try with various image sizes and dimensions to see if the token discrepancy is consistent or varies. This might help identify any specific conditions affecting token counts.

Do let me know if there are any further queries.
AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-11-13T14:06:33.76+00:00

Oechslein, Marius Just checking to see if you had a chance to review my response.

Do let me know if that helps or have any other queries.

Accepted answer

0 additional answers

Your answer

AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-11-12T13:48:30.4766667+00:00

Oechslein, Marius Greetings & Welcome to Microsoft Q&A forum!

I understand your concern.

The additional tokens could be attributed to metadata or context information that is added when the image is processed and sent. This might include information about the image, additional text, or other contextual data required for the model to generate a description.

Could you please try with various image sizes and dimensions to see if the token discrepancy is consistent or varies. This might help identify any specific conditions affecting token counts.

Do let me know if there are any further queries.
AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-11-13T14:06:33.76+00:00

Oechslein, Marius Just checking to see if you had a chance to review my response.

Do let me know if that helps or have any other queries.

Answer 1

Hi Oechslein, Marius

In the new playground, you can see the token breakdown here:

User's image

if you send the image, then enable [show JSON], you shall see the image is encoded into base64 format and sent to the server side.

based on the info from openai forum, the image_url -> detail field would control the image to be processed in high / low or auto mode. based on the discussion below, it seems to indicate the detail value is default to auto when unspecified. My thinking is that the vision model will process based on the actual resolution of the image. The best way to find out the actual consumed token for a request is to use a curl call to the endpoint with the json payload containing base64 image. you can get the curl sample using the View code option then choose curl. make sure to add curl -i so that it prints out headers too.
image_url: { url: "https://your-image-url.com", detail: "low", }

https://community.openai.com/t/gpt-4-vision-preview-fidelity-detail-parameter/477563

Share via

How does Azure OpenAI Studio Chat Playground handle images in tokens?

0 additional answers

Your answer