OpenAI o1 unable to understand documents with lines connecting data (leader lines, elbow connectors, etc.)

Tyler Suard 155 Reputation points
2025-03-14T18:03:24.61+00:00

See the attached image . Upload it to o1 and try asking it, "What are the possible options for the rightmost empty box?" This is easy for a human to do, but o1 can't do it, even with very detailed prompting.page_2

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,092 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Prashanth Veeragoni 5,245 Reputation points Microsoft External Staff Moderator
    2025-03-17T13:22:12.33+00:00

    Hi Tyler Suard,

    I Understood that your issue with OpenAI's o1 model (likely referring to GPT-4-turbo or a similar model in Azure OpenAI) not understanding leader lines and elbow connectors in documents like the one you uploaded is primarily due to how the model processes images and text. Here’s a step-by-step approach to solving this issue:

    Issue Breakdown:

    1.Complex Layouts with Leader Lines

    The document has connecting lines linking data points instead of a straightforward table format.

    OCR (Optical Character Recognition) extracts text but may not capture relationships properly.

    2.GPT-Based Models Struggle with Structure

    While GPT models can process OCR text, they may not infer relationships based on leader lines.

    Without explicit annotations or structured text, it is hard for the model to determine the correct connections.

    Solutions:

    1.Preprocess Image Before Feeding into OpenAI's Model

    Convert Image to Text + Structured Format

    Use OCR tools like Tesseract, Azure AI Document Intelligence, or OpenAI's Vision models to extract text.

    Parse the extracted text into a structured table.

    Example Approach using Python + Tesseract

    import pytesseract
    from PIL import Image
    image_path = "/mnt/data/image.png"
    img = Image.open(image_path)
    # Extract text
    extracted_text = pytesseract.image_to_string(img)
    print(extracted_text)  # Review the extracted text
    

    Enhancing OCR with Layout Parsing (LayoutLMv3 / Azure Document Intelligence)

    Use Azure AI Document Intelligence (previously Form Recognizer) to detect structured elements.

    Convert results into a JSON or tabular format.

    2.Reformat Data for OpenAI o1

    Transform extracted text into structured input for GPT

    Create a JSON or key-value format to represent relationships.

    Example JSON Format for the Document

    {
      "Ordering Code": "RPE3-04",
      "Solenoid Operated Directional Control Valve": {
          "Nominal size": "04(D02)",
          "Number of valve positions": ["2 positions", "3 positions"],
          "Seals": ["NBR", "FPM (Viton)"],
          "Orifice in P-Port": ["No orifice", "Ø0.8 mm", "Ø1.2 mm", "Ø1.5 mm", "Ø2.1 mm", "Ø2.7 mm"],
          "Manual override": ["Standard", "Covered with rubber protective boot"]
      },
      "Electrical Connector": {
          "K1": "Without connector",
          "K2": "Connector without rectifier with LED",
          "K3": "Connector with rectifier",
          "K4": "Connector with rectifier with LED and quenching diode",
          "K5": "Connector with integrated rectifier and LED"
      }
    }
    
    

    Feed this structured data to OpenAI for better understanding.

    Now, o1 can process the information without struggling with leader lines.

    3.Use Vision Models with Spatial Awareness

    Try GPT-4V (Vision) or Azure Document Intelligence

    These models can interpret relationships in structured documents with leader lines.

    Use Azure Form Recognizer’s "Key-Value Pair Extraction" or "Table Extraction" to create a structured dataset.

    Hope this helps. Do let us know if you any further queries.  

    ------------- 

    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    Thank you. 


  2. JAYA SHANKAR G S 3,960 Reputation points Microsoft External Staff Moderator
    2025-03-18T12:04:23.85+00:00

    Hi @Tyler Suard ,

    I have tried from my end with your sample image, i got the same response as you.

    And i can confirm that the image processed by model till / in the ordering code is correct and after the slash symbol (/) it is giving in reverse order. That is the reason you are getting solenoid coil labels when you ask right most options.

    So, i tried below prompt which extracts the things in JSON format and answers your query correctly.

    Extract and map the correct option or label corresponding to an empty box in an image where a line connects the box to a label. Ignore (/) and go forward.
    
    # Problem Statement
    
    You are tasked to map and associate an empty box in an image with corresponding options indicated by labels. Each box in the image has a line that connects it to one or more labeled options. Your goal is to correctly identify and map these connections. Focus on precisely interpreting the information visually presented and avoid ambiguities.
    
    # Steps
    
    1. **Input Description**:
       - Parse or describe the given image with empty boxes connected to labeled options.
       - Identify all empty boxes present in the image.
       - Detect connecting lines between the boxes and their corresponding labeled options.
    
    2. **Mapping Process**:
       - For each empty box, follow the line that connects it to one or more labeled options.
       - Ensure the correct association, especially in cases of overlapping or intersecting lines.
       - Capture all mappings in a structured format for ease of interpretation.
    
    3. **Output Mapping**:
       - Present mappings in an easy-to-read format such as a table, list, or structured JSON.
       - Each mapping should clearly indicate the box identifier (if any) or position and its respective option label(s).
    
    4. **Edge Cases**:
       - Consider scenarios where a line is broken, faint, or unclear.
       - Handle multiple connections (multi-label options) or ambiguous associations by describing the most likely mapping based on input patterns.
    
    # Output Format
    
    The mapping should be provided in format as follows:
    
    json
    {
      "mappings": [
        {
          "box": "Box 1", // A unique identifier or relative position
          "options": ["Option A", "Option C"] // List of connected options
        },
        {
          "box": "Box 2",
          "options": ["Option B"]
        }
      ]
    }
    - Replace placeholders like `Box 1` or `Option A` with appropriate identifiers from the provided input.
    
    # Notes
    
    - Ensure mappings are accurate and exhaustive; no box should be left unaccounted for.
    - Pay attention to cases where multiple lines intersect or overlap to avoid misinterpretation.
    - If the image or data includes additional visual features (e.g., color, thickness of lines), consider leveraging this information to refine mappings.
    - If you encounter any ambiguity in identifying connections, provide reasoning for the determined association and clarify any assumptions made.
    
    # Examples
    
    ### Example Mapping:
    
    *Image Description*:  
    - You have three boxes labeled "Box 1," "Box 2," and "Box 3."  
    - Lines connect these boxes to options labeled "Option A," "Option B," and "Option C."
    
    **Generated Mapping**:
    
    json
    {
      "mappings": [
        {
          "box": "Box 1",
          "options": ["Option A"]
        },
        {
          "box": "Box 2",
          "options": ["Option B"]
        },
        {
          "box": "Box 3",
          "options": ["Option C"]
        }
      ]
    }
    *(Note: Real-world examples may include placeholders like [Box Label] or [Option Label] to illustrate mappings specific to the input image.)*
    
    Sample : RPE3-04 2 ? 01200 E1 K1 N2 D1 V
    
    ---
    
    If clarification or initial analysis is being requested for line detection, ensure proper visual recognition (e.g., OCR tools, image processing libraries) are incorporated into your workflow.
    

    Here, JSON is not the output it just stores the extracted info in this format so that it can able to answer your query properly.

    Output:

    enter image description here

    It's not like you need to use exact prompt, change it to your requirement accordingly with more specific expected outcome.

    Please do let me know if you have further query.

    Thank you


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.