Share via

Help with post processing raw OCR data from document intelegence

Kimberley Attwell 0 Reputation points
2025-12-15T22:52:20.4566667+00:00

When I use document intelligence to OCR a pdf the results from Azure show a perfect table with all data in the correct columns. But when i down load the JSON file and then try and parse the data it is very messy and has data split in to different columns etc.

Please can some one give me some tips on how i can get the data to be as clean as what I can see on Azure. Document intelegence has done all the hard work?

Thank you,

Kimberley Attwell

Azure Document Intelligence in Foundry Tools
0 comments No comments

2 answers

Sort by: Most helpful
  1. Anonymous
    2025-12-25T19:31:03.0066667+00:00

    Hi Kimberley Attwell

    you’re facing some challenges with parsing the JSON output from Azure's Document Intelligence after running OCR on your PDF. It's great that the OCR results on Azure look good, but I understand how frustrating it is when the downloaded data doesn’t match up.

    Here are some tips to help you get cleaner data from the JSON output:

    Examine Table Structure: Sometimes, the table structure in your document might be complex or not easily visualized in a flat JSON output. Ensure your PDF's tables are simple enough for the OCR to interpret correctly. Complex tables may lead to messy data.

    Training Custom Models: If you continually encounter issues with the data extraction for specific documents, consider training a custom extraction model. This can improve how tables and data are interpreted. You can train the model using labeled examples to ensure better accuracy.

    Post-Processing Logic: After retrieving the JSON data, you might need to implement some post-processing. This could involve:

    • Writing scripts to clean and reformat the JSON output.
      • Merging cells programmatically if they split incorrectly.
      JSON Parsing Tools: Utilize libraries in your programming language of choice that can help parse and clean JSON data effectively. For example, libraries like pandas in Python can be particularly effective for cleaning dataframes obtained from JSON. Check Document Compatibility: Ensure that your document is in a format supported by the OCR, free from watermarks or unusual formatting that could impair recognition.

    If these suggestions don’t resolve the issue, you might want to look into specific data extraction quirks related to your document types. Sometimes, unique formatting or layouts in PDFs can create challenges.

    References:

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!

    Was this answer helpful?


  2. AI answer

    2025-12-15T22:52:29.77+00:00

    To improve the cleanliness of the data extracted from Azure Document Intelligence, consider the following tips:

    1. Understand the JSON Structure: Familiarize yourself with the structure of the JSON output. The data is organized in a way that may not directly map to the visual representation you see in Azure. Look for specific keys that contain the table data, such as tables, cells, and their respective attributes like rowIndex, columnIndex, and content.
    2. Post-Processing Logic: Implement a post-processing script that reorganizes the data based on the indices provided in the JSON. You can create a two-dimensional array or a similar structure that reflects the table format by iterating through the cells and placing them in the correct row and column based on their indices.
    3. Handle Merged Cells: If your tables have merged cells (where one cell spans multiple columns), ensure your parsing logic accounts for this by checking the columnSpan attribute in the cell data.
    4. Filter Unnecessary Data: The JSON may include additional metadata or irrelevant information. Focus on extracting only the necessary data fields that correspond to the table structure you need.
    5. Use Markdown Output: If you want to maintain the formatting seen in Azure, consider using the outputContentFormat=markdown option when making your API calls. This will give you a Markdown representation of the extracted content, which may be easier to work with.
    6. Iterate and Refine: After your initial parsing, review the output and refine your logic as needed. You may need to adjust your parsing rules based on the specific structure of the documents you are processing.

    By following these steps, you should be able to clean up the data extracted from Document Intelligence and make it more usable for your needs.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.