Load Form recognizer data into a dataframe

CzarR 296 Reputation points
2022-11-03T15:46:27.343+00:00

Hi, I am reading a pdf file using Form recognizer. Storing it in a "result" variable/object. As per the syntax given for the Azure Databricks/pyspark in the documentation for the Form recognizer my output is coming out like below. Instead I need to put the output into a dataframe. Each table into a separate dataframe. Please suggest on the syntax. Thanks in advance.

for table_idx, table in enumerate(result.tables):  
    print(  
        "Table # {} has {} rows and {} columns".format(  
        table_idx, table.row_count, table.column_count  
        )  
    )  
          
    for cell in table.cells:  
        print(  
            "...Cell[{}][{}] has content '{}'".format(  
            cell.row_index,  
            cell.column_index,  
            cell.content.encode("utf-8"),  
            )  
        )  

Output

Table # 0 has 3 rows and 7 columns  
    ...Cell[0][0] has content 'b'BIOMARKER''  
    ...Cell[0][1] has content 'b'METHOD|''  
    ...Cell[0][2] has content 'b'ANALYTE''  
    ...Cell[0][3] has content 'b'RESULT''  
    ...Cell[0][4] has content 'b'THERAPY ASSOCIATION''  
    ...Cell[0][6] has content 'b'BIOMARKER LEVELE''  
    ...Cell[1][0] has content 'b'''  
    ...Cell[1][1] has content 'b'IHC''  
    ...Cell[1][2] has content 'b'Protein''  
    ...Cell[1][3] has content 'b'Negative | 0''  
    ...Cell[1][4] has content 'b'LACK OF BENEFIT''  
    ...Cell[1][5] has content 'b'alectinib, brigatinib''  
    ...Cell[1][6] has content 'b'Level 1''  
    ...Cell[2][0] has content 'b'ALK''  
    ...Cell[2][1] has content 'b'Seq''  
    ...Cell[2][2] has content 'b'RNA-Tumor''  
    ...Cell[2][3] has content 'b'Fusion Not Detected''  
    ...Cell[2][5] has content 'b'ceritinib''  
    ...Cell[2][6] has content 'b'Level 1''  
    ...Cell[3][1] has content 'b'''  
    ...Cell[3][2] has content 'b'''  
    ...Cell[3][3] has content 'b'''  
    ...Cell[3][5] has content 'b'crizotinib''  
    ...Cell[3][6] has content 'b'Level 1''  
Table # 1 has 3 rows and 4 columns  
...Cell[0][0] has content 'b'''  
...Cell[0][1] has content 'b'''  
...Cell[0][2] has content 'b'''  
...Cell[0][3] has content 'b'''  
...Cell[1][0] has content 'b'NTRK1/2/3''  
...Cell[1][1] has content 'b'Seq''  
...Cell[1][2] has content 'b'RNA-Tumor''  
...Cell[1][3] has content 'b'Fusion Not Detected''  
...Cell[2][0] has content 'b'Tumor Mutational Burden''  
...Cell[2][1] has content 'b'Seq''  
...Cell[2][2] has content 'b'DNA-Tumor''  
...Cell[2][3] has content 'b'High | 19 Mutations/ Mb''  
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,219 questions
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,717 questions
{count} votes

Accepted answer
  1. romungi-MSFT 46,831 Reputation points Microsoft Employee
    2022-11-08T08:45:38.673+00:00

    I think you should be able to use to_dict() method of the DocumentTable result and load the dataframe.

    import pandas as pd  
    
    for table_idx, table in enumerate(result.tables):  
         print(  
             "Table # {} has {} rows and {} columns".format(  
             table_idx, table.row_count, table.column_count  
             )  
         )  
    table_pd = "table_pd." + str(table_idx)  
    table_pd = pd.DataFrame.from_dict(table.to_dict())  
    

    I think the above should work.

    If an answer is helpful, please click on 130616-image.png or upvote 130671-image.png which might help other community members reading this thread.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.