Assistants API (base model: GPT4o) unable to parse uploaded image as attachment and answer questions related to info. in image.

Question

Assistants API (base model: GPT4o) unable to parse uploaded image as attachment and answer questions related to info. in image.

GenixPRO 121

When using Assistants Playground on Azure portal: We create an Assistant using Assistants Playground. Then upload a PNG image attachment of a table with some records/content. In our prompt (see sample below) we ask questions related to this info. We get a reply from the Assistant created in Playground.

*Note: We see that Assistant is calling lib. and defining path for image etc. See below.

Load and extract text from all image files to analyze {...}

from PIL import Image

import pytesseract

Define paths for the uploaded images

image_paths = [

'/mnt/data/assistant-9s4pfyeg8W5RSwkHjdYqA4',

'/mnt/data/assistant-Uhoc9WZoxz7Tw8sRKeWxC4',

'/mnt/data/assistant-HE9LPS

When using Assistants API (w/ Assistant Thread ID) from our mobile app: Using Assistant Thread ID (created above), we tried to upload the same PNG image attachment and pass the same prompt. However, we keep getting a standard reply "It seems there was an issue with extracting the text from the image. Let's try again"

In this case, we've simply uploaded image to the thread [and not to vector store].

Sample Prompt:

Attached herein are PNG, JPG or JPEG image files. Use code interpreter to sequentially extract information from each file; read, understand, and interpret all information to make relevant inference. Then answer the following questions using the information contained in these files and any other contextual information shared earlier. <followed by questions>

Question: How do make the Assistant API work in this case, so that it can reply with information extracted from image?

2 answers

Your answer

Answer 1

Hi GenixPRO

I am able to replicate the issue with PNG file on azure portal UI.

Used tesseract in below sample code to do an OCR on image to convert to text before using code interpreter.

Here is my sample code. You can check the other samples here in github


import pytesseract

from openai import AzureOpenAI

import os

import time

from PIL import Image

client = AzureOpenAI(

api_key=os.getenv("AZURE_OPENAI_API_KEY","<endpointkey>"),

api_version="2024-05-01-preview",

azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT","<endpointurl>")

)

# Perform OCR to extract text from the image, please set environment variable to tesseract location and install the package prior

extracted_text = pytesseract.image_to_string(Image.open("C:/Users/Screenshots/sampleinterpreter.png"))

# Save the extracted text to a .txt file

text_file_path = "C:/Users/Screenshots/extracted_text.txt"

with open(text_file_path, "w") as text_file:

text_file.write(extracted_text)

# Upload the text file with an "assistants" purpose

file = client.files.create(

file=open(text_file_path, "rb"),

purpose='assistants'

)

# Create an assistant using the file ID

assistant = client.beta.assistants.create(

instructions="You are an AI assistant that can write and help analyze code to answer. The input passed is extracted text from an image file that needs to be analyzed.",

model="gpt-4o",

tools=[{"type": "code_interpreter"}],

tool_resources={"code_interpreter": {"file_ids": [file.id]}}

)

# Create a thread

thread = client.beta.threads.create()

# Add a user question to the thread

message = client.beta.threads.messages.create(

thread_id=thread.id,

role="user",

content=f"hi, Can you analyze the code and see if anything wrong and what the code is doing here. The extracted text from the image is as follows:\n\n{extracted_text}" # Replace this with your prompt

)

# Run the thread

run = client.beta.threads.runs.create(

thread_id=thread.id,

assistant_id=assistant.id

)

# Looping until the run completes or fails

while run.status in ['queued', 'in_progress', 'cancelling']:

time.sleep(1)

run = client.beta.threads.runs.retrieve(

thread_id=thread.id,

run_id=run.id

)

if run.status == 'completed':

messages = client.beta.threads.messages.list(

thread_id=thread.id

)

# Print the messages as a paragraph

for message in messages:

if message.role == "assistant":

content = message.content[0].text.value

print(f"The assistant's response: {content}")

elif run.status == 'requires_action':

# the assistant requires calling some functions

# and submit the tool outputs back to the run

pass

else:

print(run.status)

Output#


SyncCursorPage[Message](data=[Message(id='msg_dCopLLzpv1JdycDKQBWcbmY2', assistant_id='asst_ut47eXqPtxA9VdeThLGbFZIe', attachments=[], completed_at=None, content=[TextContentBlock(text=Text(annotations=[], value="Here is a cleaned-up and reconstructed version of the code based on the extracted input:\n\n### Key Features of the Code\n1. **Sorting Functionality (`sort_by`):**\n - `sort_by` accepts a list of callback functions (`cbs`), which determine the sorting logic.\n - Each callback can optionally specify whether sorting should be done in descending order using the `desc` attribute.\n\n2. **Custom Sorting Logic:**\n - The code iterates through the list of callbacks.\n - For each callback, it extracts values from objects `a` and `b` to compare them.\n - Depending on the `desc` attribute and whether the values are strings or numbers, it calculates the difference for sorting. If non-zero, the difference determines the order.\n\n3. **Descending Mode (`desc`):**\n - A utility function (`desc`) is provided to wrap a callback with metadata indicating descending order.\n\n4. **Example Callback Function (`sample_callback`):**\n - Demonstrates simple extraction of the `value` field from items.\n\n### How It Works\n- Use `sort_by` to define comparison logic for sorting an array, and optionally specify sorting in descending order.\n- Pass in an array of callback definitions (`cbs`), where each callback specifies how to extract comparison values.\n\n#### Example `callbacks` Output:\npython\n[{'desc': True, 'cb': <function sample_callback>}]\n\n\nThis structure indicates the callback for sorting in descending order.\n\nLet me know if you'd like help with examples or further clarification!"), type='text')], t_k75Vg4KI31emNavmVbaKxiou', attachments=[], completed_at=None, content=[TextContentBlock(text=Text(annotations=[], value="This code snippet appears to be JavaScript code dealing with a sort function (`sortBy`) and a descending comparator function (`desc`). However, there are syntactical issues and potential logic issues with the extracted text. Below, I'll highlight the extracted code issues and explain what the code is attempting to do.\n\n### Translation of the extracted code with issues highlighted:\nThe code appears garbled and incomplete in several places. Here's how it looks reformatted based on the extracted text:\njavascript\nconst sortBy = (cbs) => (a, b) => {\n for (let i = 0; i < cbs.length; i++) {\n const cb = cbs[i].desc ? cbs[i].cb : cbs[i];\n const aa = cb(a);\n const bb = cb(b);\n const diff = cbs[i].desc\n ? (typeof aa === 'string'\n ? bb.localeCompare(aa)\n : bb - aa)\n : (typeof aa === 'string'\n ? aa.localeCompare(bb)\n : aa - bb);\n if (diff !== 0) return diff;\n }\n return 0;\n};\n\nconst desc = (cb) => ({ desc: true, cb });\n\n\n### Issues and Observations in the Code\n1. **Syntax Errors**:\n - The initial `sortBy` function has improper syntax in its arrow function and parameters. It uses `>` incorrectly for the arrow function replacement.\n - The extracted conditional blocks (`diff`) have misplaced syntax: the `isString(aa)` function is used without definition or context.\n\n2. **Logical Errors**:\n - There is a confusion in comparing strings using `localeCompare`. Ensure it is dealing with strings only when using this method.\n\n3. **Missing Context**:\n - `isString` is likely meant to check if `aa` and `bb` are strings but is missing. Native JavaScript doesn't have `isString`; instead `typeof variable === 'string'` should be used.\n - The variable `cbs` is expected to be an array of objects, where each object may contain a `desc` key and a `cb` function, but no example `cbs` structure is provided.\n - There’s no explanation of what `a` and `b` represent (likely elements to be sorted).\n\n4. **Undefined `Q`**:\n - `Q` is referenced but has no definition or mention of use in this context.\n\n### What the Code is Doing\n#### `sortBy` Function:\n- `sortBy(cbs)` generates a comparator function for sorting an array of elements (`a` and `b`).\n- The `cbs` parameter is expected to be a list of callback functions or objects with `{ cb, desc }`. Each `cb` transforms elements for comparison and `desc` indicates whether sorting should be in descending order.\n- Iterates through `cbs` and compares elements (`a` and `b`) using each callback:\n - If a difference (`diff`) is found using one of the callbacks, it returns the difference.\n - Strings are compared using `.localeCompare()`.\n - Numbers or other types are compared using subtraction (`-`).\n- If all callbacks result in zero difference (i.e., equal elements), the function returns

Hope it helps

Thank you.

GenixPRO 121 Reputation points

2025-02-28T06:10:57.69+00:00

@Manas Mohanty does this mean Assistant can't accept an image directly and "read" its contents. We've to 1st convert image to text, then pass it to the assistant?
GenixPRO 121 Reputation points

2025-02-28T06:19:54.4566667+00:00

what if the image is an X-Ray or radiology image? There's no text. How will we extract information & pass to Assistant in such case?
Manas Mohanty 6,265 Reputation points Microsoft External Staff Moderator

2025-02-28T06:20:18.3466667+00:00

Hi GenixPRO

You can refer Vision Assistant "analyze_image" function from below document to analyze image without doing OCR from Tessaract..

Assistant multi-agent

Thank you
GenixPRO 121 Reputation points

2025-02-28T06:26:57.1933333+00:00

https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/gpt-with-vision

Our Assistant is created on GPT4o (vision enabled). We want chat + vision functionality (i.e. "read & interpret" and image, reply to questions in chat). Is this not natively supported in GPT4o? Do we need to convert image to Base 64 and pass to Assistant for it to interpret & reply?
Manas Mohanty 6,265 Reputation points Microsoft External Staff Moderator

2025-02-28T07:37:17.0966667+00:00

Hi GenixPRO

You can interact with image with GPT4o directly in chat playground. But Assistant, code only approach work to interact (with OCR/base64 encoding)

https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/gpt-with-vision

Thank you.
GenixPRO 121 Reputation points

2025-03-02T00:05:55.23+00:00

@Manas Mohanty it appears inefficient that one has to pass a OCR/base64 encoding image to a model with "native vision capability". We use GPT4o based Assistant, and until 2 days ago we could upload an image directly in app chat and Assistant was responding with replies to info. contained in image. Something changed and now we don't get similar functionality. However, on Azure portal "Assistants playground" we're able to upload image and Assistant continues to reply as expected.

Passing a 7-10 page document as a base64 image may not only cause us to likely exceed token limit, but also the Assistant has gone into a loop & unable to recover. There's gotta be a better way.
GenixPRO 121 Reputation points

2025-03-02T01:49:37.7233333+00:00

Passing base64 image in text/chat results in error. There's gotta be a better way. Why pass OCR/Base64 to a "Vision native model like GPT4o"? Isn't the model able to accept image directly (& convert)? Until 2 days ago our Assistant in app seemed to work (take an image, reply with info. on image content) and now it doesn't. We're able to upload image using Azure portal Assistant playground & get replies on image content. seems like we're missing something basic to get this to work in Assistants API.
JAYA SHANKAR G S 3,960 Reputation points Microsoft External Staff Moderator

2025-03-03T10:21:52.2466667+00:00

Hi GenixPRO,

Please provide the code you are using, so that we can reproduce the issue from out end.

thank you

Answer 2

Hi @GenixPRO

You are almost there, all you have to do is give a prompt in such a way it should find the image file and extract the information.

Below is the prompt i tried.


I have attached list of files for code interpreter. 

According to the question you extract the image data from the png or jpeg whatever the image files provided and provide the answer. 

The filename will be provided in question itself.

Here, i am asking to search image file given in query and extract information, you just give matching file name not the whole.

Below is the image

enter image description here

and the file name is 2025-01-21 17_32_08-jgsynap - Azure Synapse Analytics and 8 more pages - Work - Microsoft Edge.

Query asked.


Explain more about the synapse analytics image, extract any code present.

Output:

enter image description here

Content


'un all |

lot started

oVNaunAWn!

» Undo Vv =} Outline Attach to | tst tee

kv_name = "default_value_kv_name"

server_name = "default_values_server_name"

user_name = "default_values_user_name"

password_name = "default_values_password_name"

database_name = "default_values_database_name"

account_name = "default_values_account_name"

account_key = "default_values_account_key"

from notebookutils import mssparkutils

mssparkutils.notebook.exit(function_name)

M oie

NV; Move cell down

* Hide input

* Hide output

[@] Toggle parameter cell

z Merge with next cell

TH Split cell

Explanation

enter image description here

Next further to know the code usage aske below kind of query.


what is code used to give above results? please provide them

Thank you

JAYA SHANKAR G S 3,960 Reputation points Microsoft External Staff Moderator

2025-03-04T04:33:02.1933333+00:00

Hi @GenixPRO,

Did you try above solution? do let me know if any query.

Thank you.
JAYA SHANKAR G S 3,960 Reputation points Microsoft External Staff Moderator

2025-03-05T04:35:48.1433333+00:00

Hi @GenixPRO,

Just checking were you able to resolve the issue by using above given steps.
Let me know if you any query.

Thank you

Share via

Assistants API (base model: GPT4o) unable to parse uploaded image as attachment and answer questions related to info. in image.

Load and extract text from all image files to analyze {...}

Define paths for the uploaded images

2 answers

Your answer