Hi, I am reading a pdf file using Form recognizer. Storing it in a "result" variable/object. As per the syntax given for the Azure Databricks/pyspark in the documentation for the Form recognizer my output is coming out like below. Instead I need to put the output into a dataframe. Each table into a separate dataframe. Please suggest on the syntax. Thanks in advance.
for table_idx, table in enumerate(result.tables):
print(
"Table # {} has {} rows and {} columns".format(
table_idx, table.row_count, table.column_count
)
)
for cell in table.cells:
print(
"...Cell[{}][{}] has content '{}'".format(
cell.row_index,
cell.column_index,
cell.content.encode("utf-8"),
)
)
Output
Table # 0 has 3 rows and 7 columns
...Cell[0][0] has content 'b'BIOMARKER''
...Cell[0][1] has content 'b'METHOD|''
...Cell[0][2] has content 'b'ANALYTE''
...Cell[0][3] has content 'b'RESULT''
...Cell[0][4] has content 'b'THERAPY ASSOCIATION''
...Cell[0][6] has content 'b'BIOMARKER LEVELE''
...Cell[1][0] has content 'b'''
...Cell[1][1] has content 'b'IHC''
...Cell[1][2] has content 'b'Protein''
...Cell[1][3] has content 'b'Negative | 0''
...Cell[1][4] has content 'b'LACK OF BENEFIT''
...Cell[1][5] has content 'b'alectinib, brigatinib''
...Cell[1][6] has content 'b'Level 1''
...Cell[2][0] has content 'b'ALK''
...Cell[2][1] has content 'b'Seq''
...Cell[2][2] has content 'b'RNA-Tumor''
...Cell[2][3] has content 'b'Fusion Not Detected''
...Cell[2][5] has content 'b'ceritinib''
...Cell[2][6] has content 'b'Level 1''
...Cell[3][1] has content 'b'''
...Cell[3][2] has content 'b'''
...Cell[3][3] has content 'b'''
...Cell[3][5] has content 'b'crizotinib''
...Cell[3][6] has content 'b'Level 1''
Table # 1 has 3 rows and 4 columns
...Cell[0][0] has content 'b'''
...Cell[0][1] has content 'b'''
...Cell[0][2] has content 'b'''
...Cell[0][3] has content 'b'''
...Cell[1][0] has content 'b'NTRK1/2/3''
...Cell[1][1] has content 'b'Seq''
...Cell[1][2] has content 'b'RNA-Tumor''
...Cell[1][3] has content 'b'Fusion Not Detected''
...Cell[2][0] has content 'b'Tumor Mutational Burden''
...Cell[2][1] has content 'b'Seq''
...Cell[2][2] has content 'b'DNA-Tumor''
...Cell[2][3] has content 'b'High | 19 Mutations/ Mb''