Assistants API (base model: GPT4o) unable to parse uploaded image as attachment and answer questions related to info. in image.

GenixPRO 121 Reputation points
2025-02-26T08:50:16.7+00:00
  1. When using Assistants Playground on Azure portal: We create an Assistant using Assistants Playground. Then upload a PNG image attachment of a table with some records/content. In our prompt (see sample below) we ask questions related to this info. We get a reply from the Assistant created in Playground.

*Note: We see that Assistant is calling lib. and defining path for image etc. See below.


Load and extract text from all image files to analyze {...}

from PIL import Image

import pytesseract

Define paths for the uploaded images

image_paths = [

'/mnt/data/assistant-9s4pfyeg8W5RSwkHjdYqA4',

'/mnt/data/assistant-Uhoc9WZoxz7Tw8sRKeWxC4',

'/mnt/data/assistant-HE9LPS

  1. When using Assistants API (w/ Assistant Thread ID) from our mobile app: Using Assistant Thread ID (created above), we tried to upload the same PNG image attachment and pass the same prompt. However, we keep getting a standard reply "It seems there was an issue with extracting the text from the image. Let's try again"

In this case, we've simply uploaded image to the thread [and not to vector store].

Sample Prompt:

Attached herein are PNG, JPG or JPEG image files. Use code interpreter to sequentially extract information from each file; read, understand, and interpret all information to make relevant inference. Then answer the following questions using the information contained in these files and any other contextual information shared earlier. <followed by questions>

Question: How do make the Assistant API work in this case, so that it can reply with information extracted from image?

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,632 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Manas Mohanty 6,115 Reputation points Microsoft External Staff Moderator
    2025-02-27T11:40:46.6933333+00:00

    Hi GenixPRO

    I am able to replicate the issue with PNG file on azure portal UI.

    Used tesseract in below sample code to do an OCR on image to convert to text before using code interpreter.

    Here is my sample code. You can check the other samples here in github

    
    import pytesseract
    
    from openai import AzureOpenAI
    
    import os
    
    import time
    
    from PIL import Image
    
    client = AzureOpenAI(
    
    api_key=os.getenv("AZURE_OPENAI_API_KEY","<endpointkey>"),
    
    api_version="2024-05-01-preview",
    
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT","<endpointurl>")
    
    )
    
    # Perform OCR to extract text from the image, please set environment variable to tesseract location and install the package prior
    
    extracted_text = pytesseract.image_to_string(Image.open("C:/Users/Screenshots/sampleinterpreter.png"))
    
    # Save the extracted text to a .txt file
    
    text_file_path = "C:/Users/Screenshots/extracted_text.txt"
    
    with open(text_file_path, "w") as text_file:
    
    text_file.write(extracted_text)
    
    # Upload the text file with an "assistants" purpose
    
    file = client.files.create(
    
    file=open(text_file_path, "rb"),
    
    purpose='assistants'
    
    )
    
    # Create an assistant using the file ID
    
    assistant = client.beta.assistants.create(
    
    instructions="You are an AI assistant that can write and help analyze code to answer. The input passed is extracted text from an image file that needs to be analyzed.",
    
    model="gpt-4o",
    
    tools=[{"type": "code_interpreter"}],
    
    tool_resources={"code_interpreter": {"file_ids": [file.id]}}
    
    )
    
    # Create a thread
    
    thread = client.beta.threads.create()
    
    # Add a user question to the thread
    
    message = client.beta.threads.messages.create(
    
    thread_id=thread.id,
    
    role="user",
    
    content=f"hi, Can you analyze the code and see if anything wrong and what the code is doing here. The extracted text from the image is as follows:\n\n{extracted_text}" # Replace this with your prompt
    
    )
    
    # Run the thread
    
    run = client.beta.threads.runs.create(
    
    thread_id=thread.id,
    
    assistant_id=assistant.id
    
    )
    
    # Looping until the run completes or fails
    
    while run.status in ['queued', 'in_progress', 'cancelling']:
    
    time.sleep(1)
    
    run = client.beta.threads.runs.retrieve(
    
    thread_id=thread.id,
    
    run_id=run.id
    
    )
    
    if run.status == 'completed':
    
    messages = client.beta.threads.messages.list(
    
    thread_id=thread.id
    
    )
    
    # Print the messages as a paragraph
    
    for message in messages:
    
    if message.role == "assistant":
    
    content = message.content[0].text.value
    
    print(f"The assistant's response: {content}")
    
    elif run.status == 'requires_action':
    
    # the assistant requires calling some functions
    
    # and submit the tool outputs back to the run
    
    pass
    
    else:
    
    print(run.status)
    
    

    Output#

    
    SyncCursorPage[Message](data=[Message(id='msg_dCopLLzpv1JdycDKQBWcbmY2', assistant_id='asst_ut47eXqPtxA9VdeThLGbFZIe', attachments=[], completed_at=None, content=[TextContentBlock(text=Text(annotations=[], value="Here is a cleaned-up and reconstructed version of the code based on the extracted input:\n\n### Key Features of the Code\n1. **Sorting Functionality (`sort_by`):**\n - `sort_by` accepts a list of callback functions (`cbs`), which determine the sorting logic.\n - Each callback can optionally specify whether sorting should be done in descending order using the `desc` attribute.\n\n2. **Custom Sorting Logic:**\n - The code iterates through the list of callbacks.\n - For each callback, it extracts values from objects `a` and `b` to compare them.\n - Depending on the `desc` attribute and whether the values are strings or numbers, it calculates the difference for sorting. If non-zero, the difference determines the order.\n\n3. **Descending Mode (`desc`):**\n - A utility function (`desc`) is provided to wrap a callback with metadata indicating descending order.\n\n4. **Example Callback Function (`sample_callback`):**\n - Demonstrates simple extraction of the `value` field from items.\n\n### How It Works\n- Use `sort_by` to define comparison logic for sorting an array, and optionally specify sorting in descending order.\n- Pass in an array of callback definitions (`cbs`), where each callback specifies how to extract comparison values.\n\n#### Example `callbacks` Output:\npython\n[{'desc': True, 'cb': <function sample_callback>}]\n\n\nThis structure indicates the callback for sorting in descending order.\n\nLet me know if you'd like help with examples or further clarification!"), type='text')], t_k75Vg4KI31emNavmVbaKxiou', attachments=[], completed_at=None, content=[TextContentBlock(text=Text(annotations=[], value="This code snippet appears to be JavaScript code dealing with a sort function (`sortBy`) and a descending comparator function (`desc`). However, there are syntactical issues and potential logic issues with the extracted text. Below, I'll highlight the extracted code issues and explain what the code is attempting to do.\n\n### Translation of the extracted code with issues highlighted:\nThe code appears garbled and incomplete in several places. Here's how it looks reformatted based on the extracted text:\njavascript\nconst sortBy = (cbs) => (a, b) => {\n for (let i = 0; i < cbs.length; i++) {\n const cb = cbs[i].desc ? cbs[i].cb : cbs[i];\n const aa = cb(a);\n const bb = cb(b);\n const diff = cbs[i].desc\n ? (typeof aa === 'string'\n ? bb.localeCompare(aa)\n : bb - aa)\n : (typeof aa === 'string'\n ? aa.localeCompare(bb)\n : aa - bb);\n if (diff !== 0) return diff;\n }\n return 0;\n};\n\nconst desc = (cb) => ({ desc: true, cb });\n\n\n### Issues and Observations in the Code\n1. **Syntax Errors**:\n - The initial `sortBy` function has improper syntax in its arrow function and parameters. It uses `>` incorrectly for the arrow function replacement.\n - The extracted conditional blocks (`diff`) have misplaced syntax: the `isString(aa)` function is used without definition or context.\n\n2. **Logical Errors**:\n - There is a confusion in comparing strings using `localeCompare`. Ensure it is dealing with strings only when using this method.\n\n3. **Missing Context**:\n - `isString` is likely meant to check if `aa` and `bb` are strings but is missing. Native JavaScript doesn't have `isString`; instead `typeof variable === 'string'` should be used.\n - The variable `cbs` is expected to be an array of objects, where each object may contain a `desc` key and a `cb` function, but no example `cbs` structure is provided.\n - There’s no explanation of what `a` and `b` represent (likely elements to be sorted).\n\n4. **Undefined `Q`**:\n - `Q` is referenced but has no definition or mention of use in this context.\n\n### What the Code is Doing\n#### `sortBy` Function:\n- `sortBy(cbs)` generates a comparator function for sorting an array of elements (`a` and `b`).\n- The `cbs` parameter is expected to be a list of callback functions or objects with `{ cb, desc }`. Each `cb` transforms elements for comparison and `desc` indicates whether sorting should be in descending order.\n- Iterates through `cbs` and compares elements (`a` and `b`) using each callback:\n - If a difference (`diff`) is found using one of the callbacks, it returns the difference.\n - Strings are compared using `.localeCompare()`.\n - Numbers or other types are compared using subtraction (`-`).\n- If all callbacks result in zero difference (i.e., equal elements), the function returns
    
    

    Hope it helps

    Thank you.


  2. JAYA SHANKAR G S 3,960 Reputation points Microsoft External Staff Moderator
    2025-03-03T12:07:52.2766667+00:00

    Hi @GenixPRO

    You are almost there, all you have to do is give a prompt in such a way it should find the image file and extract the information.

    Below is the prompt i tried.

    
    I have attached list of files for code interpreter. 
    
    According to the question you extract the image data from the png or jpeg whatever the image files provided and provide the answer. 
    
    The filename will be provided in question itself.
    
    

    Here, i am asking to search image file given in query and extract information, you just give matching file name not the whole.

    Below is the image

    enter image description here

    and the file name is 2025-01-21 17_32_08-jgsynap - Azure Synapse Analytics and 8 more pages - Work - Microsoft​ Edge.

    Query asked.

    
    Explain more about the synapse analytics image, extract any code present.
    
    

    Output:

    enter image description here

    Content

    
    'un all |
    
    lot started
    
    oVNaunAWn!
    
    » Undo Vv =} Outline Attach to | tst tee
    
    kv_name = "default_value_kv_name"
    
    server_name = "default_values_server_name"
    
    user_name = "default_values_user_name"
    
    password_name = "default_values_password_name"
    
    database_name = "default_values_database_name"
    
    account_name = "default_values_account_name"
    
    account_key = "default_values_account_key"
    
    from notebookutils import mssparkutils
    
    mssparkutils.notebook.exit(function_name)
    
    M oie
    
    NV; Move cell down
    
    * Hide input
    
    * Hide output
    
    [@] Toggle parameter cell
    
    z Merge with next cell
    
    TH Split cell
    
    

    Explanation

    enter image description here

    Next further to know the code usage aske below kind of query.

    
    what is code used to give above results? please provide them
    
    

    Thank you


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.