Load Form recognizer data into a dataframe

Question

Load Form recognizer data into a dataframe

CzarR 316

Hi, I am reading a pdf file using Form recognizer. Storing it in a "result" variable/object. As per the syntax given for the Azure Databricks/pyspark in the documentation for the Form recognizer my output is coming out like below. Instead I need to put the output into a dataframe. Each table into a separate dataframe. Please suggest on the syntax. Thanks in advance.

for table_idx, table in enumerate(result.tables):  
    print(  
        "Table # {} has {} rows and {} columns".format(  
        table_idx, table.row_count, table.column_count  
        )  
    )  
          
    for cell in table.cells:  
        print(  
            "...Cell[{}][{}] has content '{}'".format(  
            cell.row_index,  
            cell.column_index,  
            cell.content.encode("utf-8"),  
            )  
        )

Output

Table # 0 has 3 rows and 7 columns  
    ...Cell[0][0] has content 'b'BIOMARKER''  
    ...Cell[0][1] has content 'b'METHOD|''  
    ...Cell[0][2] has content 'b'ANALYTE''  
    ...Cell[0][3] has content 'b'RESULT''  
    ...Cell[0][4] has content 'b'THERAPY ASSOCIATION''  
    ...Cell[0][6] has content 'b'BIOMARKER LEVELE''  
    ...Cell[1][0] has content 'b'''  
    ...Cell[1][1] has content 'b'IHC''  
    ...Cell[1][2] has content 'b'Protein''  
    ...Cell[1][3] has content 'b'Negative | 0''  
    ...Cell[1][4] has content 'b'LACK OF BENEFIT''  
    ...Cell[1][5] has content 'b'alectinib, brigatinib''  
    ...Cell[1][6] has content 'b'Level 1''  
    ...Cell[2][0] has content 'b'ALK''  
    ...Cell[2][1] has content 'b'Seq''  
    ...Cell[2][2] has content 'b'RNA-Tumor''  
    ...Cell[2][3] has content 'b'Fusion Not Detected''  
    ...Cell[2][5] has content 'b'ceritinib''  
    ...Cell[2][6] has content 'b'Level 1''  
    ...Cell[3][1] has content 'b'''  
    ...Cell[3][2] has content 'b'''  
    ...Cell[3][3] has content 'b'''  
    ...Cell[3][5] has content 'b'crizotinib''  
    ...Cell[3][6] has content 'b'Level 1''  
Table # 1 has 3 rows and 4 columns  
...Cell[0][0] has content 'b'''  
...Cell[0][1] has content 'b'''  
...Cell[0][2] has content 'b'''  
...Cell[0][3] has content 'b'''  
...Cell[1][0] has content 'b'NTRK1/2/3''  
...Cell[1][1] has content 'b'Seq''  
...Cell[1][2] has content 'b'RNA-Tumor''  
...Cell[1][3] has content 'b'Fusion Not Detected''  
...Cell[2][0] has content 'b'Tumor Mutational Burden''  
...Cell[2][1] has content 'b'Seq''  
...Cell[2][2] has content 'b'DNA-Tumor''  
...Cell[2][3] has content 'b'High | 19 Mutations/ Mb''

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-11-04T06:03:59.447+00:00

Hello @CzarR ,

Thanks for the question and using MS Q&A platform.

Could you please share the document which you are referring too?
CzarR 316 Reputation points

2022-11-04T14:00:02.35+00:00

257245-pd-05ymwgtagdvxsqhvzddvzz2021-12-07zzdh-05z0qxy2gg.pdf

Page3 has the table. Thanks for the help.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-11-07T06:25:06.457+00:00

Hello @CzarR ,

When you say as per the syntax given for the Azure Databricks/pyspark in the documentation for the Form recognizer - kindly share the documentation link here?
CzarR 316 Reputation points

2022-11-07T14:17:33.217+00:00

https://learn.microsoft.com/en-us/python/api/overview/azure/ai-formrecognizer-readme?view=azure-python

Here you go. Thank you.

anonymous_user 0

Try this to get all your tables out of a result:

import numpy as np
result_dict = result.to_dict()

all_tables = []
for idx, atable in enumerate(result_dict["tables"]):
    l = list()
    row_count = atable["row_count"]
    column_count = atable["column_count"]
    for aval in atable["cells"]:
        l.append(aval["content"])
    df = pd.DataFrame(np.array(l).reshape(row_count, column_count))
    df.columns = df.iloc[0]
    df = df.drop(df.index[0])
    all_tables.append(df)

Accepted answer

0 additional answers

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-11-04T06:03:59.447+00:00

Hello @CzarR ,

Thanks for the question and using MS Q&A platform.

Could you please share the document which you are referring too?
CzarR 316 Reputation points

2022-11-04T14:00:02.35+00:00

257245-pd-05ymwgtagdvxsqhvzddvzz2021-12-07zzdh-05z0qxy2gg.pdf

Page3 has the table. Thanks for the help.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-11-07T06:25:06.457+00:00

Hello @CzarR ,

When you say as per the syntax given for the Azure Databricks/pyspark in the documentation for the Form recognizer - kindly share the documentation link here?
CzarR 316 Reputation points

2022-11-07T14:17:33.217+00:00

https://learn.microsoft.com/en-us/python/api/overview/azure/ai-formrecognizer-readme?view=azure-python

Here you go. Thank you.
anonymous_user 0 Reputation points

2023-08-12T23:21:02.13+00:00

Try this to get all your tables out of a result:

import numpy as np result_dict = result.to_dict() all_tables = [] for idx, atable in enumerate(result_dict["tables"]): l = list() row_count = atable["row_count"] column_count = atable["column_count"] for aval in atable["cells"]: l.append(aval["content"]) df = pd.DataFrame(np.array(l).reshape(row_count, column_count)) df.columns = df.iloc[0] df = df.drop(df.index[0]) all_tables.append(df)

Answer 1

romungi-MSFT 48,906 Microsoft Employee Moderator

I think you should be able to use to_dict() method of the DocumentTable result and load the dataframe.

import pandas as pd  

for table_idx, table in enumerate(result.tables):  
     print(  
         "Table # {} has {} rows and {} columns".format(  
         table_idx, table.row_count, table.column_count  
         )  
     )  
table_pd = "table_pd." + str(table_idx)  
table_pd = pd.DataFrame.from_dict(table.to_dict())

I think the above should work.

If an answer is helpful, please click on or upvote which might help other community members reading this thread.

CzarR 316 Reputation points

2022-11-08T14:45:35.807+00:00

Thank you. That seems to work.
Patrick Toulson 0 Reputation points

2023-04-13T18:08:37.87+00:00

I get the following error:

ValueError: All arrays must be of the same length
Anonymous

2023-10-06T10:14:10.6466667+00:00

It won't work always as the dict elements will have row_count, column_count, few other elements, followed by a list of cells where each cell will have few elements. Hence, table.to_dict will nowhere produced the desired outcome.

Share via

Load Form recognizer data into a dataframe

0 additional answers

Your answer