Load Multiple Parquet files from ADLS account to Dataframe with Filename

WZIAFP 157 Reputation points
2022-09-15T09:56:41.027+00:00

I have a SQL table which contains a list of files stored in ADLS Gen2 storage that I need to load into a dataframe.
The location of the files changes for each file as for date partition example of 5 files below:
241377-image.png

I need to load these files to a dataframe with the filename appended then deduplicate and load the result into a sql table.

import pandas as pd  
import pyodbc  
import pyspark  
  
server = 'server.sql.azuresynapse.net'   
database = 'db'   
username = 'sqladminuser'   
password = 'pwd'   
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)  
  
#Load Files to dataframe  
df = pd.read_sql ('Select path from dbo.SourceFiles',cnxn)  
df2 = pd.DataFrame(df)  
  
#Iterate through files  
for index, row in df2.iterrows():  
    data_path = spark.read.load(row["path"], format='parquet', header=True)  
    data_path.show(10)  
    print('Converting to Pandas.')  
    pdf = data_path.toPandas()  
    pdf = pdf.assign(Filename = row["path"])  
pdf.drop_duplicates()  
print(pdf)  

Problem I have with my code is that the final pdf dataframe only returns the last file that was processed. Is it possible to use this method to append each iteration?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,395 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. WZIAFP 157 Reputation points
    2022-09-15T14:22:30.29+00:00

    Below is the solution. using pandas.concat function

    FinalDf = pd.DataFrame()  
    for index, row in df2.iterrows():  
        data_path = spark.read.load(row["path"], format='parquet', header=True)  
        #data_path.show(10)  
        print('Converting to Pandas.')  
        pdf = data_path.toPandas()  
        pdf = pdf.assign(Filename = row["path"])  
        FinalDf = pd.concat([FinalDf,pdf])  
    
    1 person found this answer helpful.