Load Form recognizer data into a dataframe

CzarR 296 Reputation points
2022-11-03T15:46:27.343+00:00

Hi, I am reading a pdf file using Form recognizer. Storing it in a "result" variable/object. As per the syntax given for the Azure Databricks/pyspark in the documentation for the Form recognizer my output is coming out like below. Instead I need to put the output into a dataframe. Each table into a separate dataframe. Please suggest on the syntax. Thanks in advance.

for table_idx, table in enumerate(result.tables):  
    print(  
        "Table # {} has {} rows and {} columns".format(  
        table_idx, table.row_count, table.column_count  
        )  
    )  
          
    for cell in table.cells:  
        print(  
            "...Cell[{}][{}] has content '{}'".format(  
            cell.row_index,  
            cell.column_index,  
            cell.content.encode("utf-8"),  
            )  
        )  

Output

Table # 0 has 3 rows and 7 columns  
    ...Cell[0][0] has content 'b'BIOMARKER''  
    ...Cell[0][1] has content 'b'METHOD|''  
    ...Cell[0][2] has content 'b'ANALYTE''  
    ...Cell[0][3] has content 'b'RESULT''  
    ...Cell[0][4] has content 'b'THERAPY ASSOCIATION''  
    ...Cell[0][6] has content 'b'BIOMARKER LEVELE''  
    ...Cell[1][0] has content 'b'''  
    ...Cell[1][1] has content 'b'IHC''  
    ...Cell[1][2] has content 'b'Protein''  
    ...Cell[1][3] has content 'b'Negative | 0''  
    ...Cell[1][4] has content 'b'LACK OF BENEFIT''  
    ...Cell[1][5] has content 'b'alectinib, brigatinib''  
    ...Cell[1][6] has content 'b'Level 1''  
    ...Cell[2][0] has content 'b'ALK''  
    ...Cell[2][1] has content 'b'Seq''  
    ...Cell[2][2] has content 'b'RNA-Tumor''  
    ...Cell[2][3] has content 'b'Fusion Not Detected''  
    ...Cell[2][5] has content 'b'ceritinib''  
    ...Cell[2][6] has content 'b'Level 1''  
    ...Cell[3][1] has content 'b'''  
    ...Cell[3][2] has content 'b'''  
    ...Cell[3][3] has content 'b'''  
    ...Cell[3][5] has content 'b'crizotinib''  
    ...Cell[3][6] has content 'b'Level 1''  
Table # 1 has 3 rows and 4 columns  
...Cell[0][0] has content 'b'''  
...Cell[0][1] has content 'b'''  
...Cell[0][2] has content 'b'''  
...Cell[0][3] has content 'b'''  
...Cell[1][0] has content 'b'NTRK1/2/3''  
...Cell[1][1] has content 'b'Seq''  
...Cell[1][2] has content 'b'RNA-Tumor''  
...Cell[1][3] has content 'b'Fusion Not Detected''  
...Cell[2][0] has content 'b'Tumor Mutational Burden''  
...Cell[2][1] has content 'b'Seq''  
...Cell[2][2] has content 'b'DNA-Tumor''  
...Cell[2][3] has content 'b'High | 19 Mutations/ Mb''  
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,391 questions
Azure Form Recognizer
Azure Form Recognizer
An Azure service that applies machine learning to extract text, key/value pairs, tables, and structures from documents.
765 questions
{count} votes

Accepted answer
  1. romungi-MSFT 32,531 Reputation points Microsoft Employee
    2022-11-08T08:45:38.673+00:00

    I think you should be able to use to_dict() method of the DocumentTable result and load the dataframe.

    import pandas as pd  
    
    for table_idx, table in enumerate(result.tables):  
         print(  
             "Table # {} has {} rows and {} columns".format(  
             table_idx, table.row_count, table.column_count  
             )  
         )  
    table_pd = "table_pd." + str(table_idx)  
    table_pd = pd.DataFrame.from_dict(table.to_dict())  
    

    I think the above should work.

    If an answer is helpful, please click on 130616-image.png or upvote 130671-image.png which might help other community members reading this thread.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful