How to read all the files from a ADLS folder and load data in dataframe uisng pyspark in Synapse?

Devender 61 Reputation points
2022-11-25T19:37:04.067+00:00

I am Looking for an solution where i want to read all the files from a folder Name as **'COUNTRIES DETAIL' containing another folder 'YEAR' which is there in a Container in ADLS GEN2 name 'DETAILS'. . Currently i have 200 files in my sub folder 'YEAR'. the format of files are 'YYYY_DETAILS_GERMANY_GERMAN_.csv' , 'YYYY_DETAILS_INDIA_GOOD_.csv', 'YYYY_DETAILS_ENGLAND_PRODUCTS_.csv'.

I am looking for a approach through which i can read all the files from the Folder and use that file only which is required for transformations. The Scenario i want to create is once i read the files from the folder, i want to pick only that file which is required for operations.

HOW WE SHOULD PICK THE FILE OR WHICH FILE TO PICK??

I am reading an file already from another folder and stored that data in dataframe df . The dataframe have a column called 'COUNTRY_NAME'. There will be common value in 'COUNTRY_NAME' column in the dataframe. Suppose we have value as 'INDIAGOOD' in COUNTRY_NAME Column so i need to pick 'YYYY_DETAILS_INDIA_GOOD_.csv' file data and load into my another Dataframe . simlilarly for other cases i need to implement this same scenario. if the value in COUNTRY_NAME column matches with the filename present in adls i want to load that file data into dataframe and do operations. There will always be common value in COUNTRY_NAME Column which have some matching Text with the file that is there in 'YEAR FOLDER'

SAMPLE DATAFRAME 1

+-----------+
COUNTRY_NAME|
+-----------+
INDIAGOOD
INDIAGOOD
INDIAGOOD
INDIAGOOD
FILENAMES IN ADLS

+---------------------------------+
FILE_NAMES |
+---------------------------------+
YYYY_DETAILS_INDIA_GOOD_.csv |
YYYY_DETAILS_ENGLAND_PRODUCTS_.csv|
YYYY_DETAILS_GERMANY_GERMAN_.csv |

As we have value 'INDIAGOOD' in my column value 'COUNTRY_NAME' i will pick file YYYY_DETAILS_INDIA_GOOD_.csv from the adls because its substring matching with filename in adls and load it into my dataframe df2

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,675 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,073 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Evgeny Grishchenko 486 Reputation points
    2022-11-26T11:50:43.287+00:00
    1. Pyspark allows you to read many files recursively and defining filter, which is applied to file names. Check here: https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-global-filter
    2. your requirements would need some programming. I would suggest (if I get it correct):
      • transform your SAMPLE DATAFRAME 1 into array and loop thru it
      • based on value in the column country_name load the the appropriate csv file(-s) into a dataframe
      • do something about it
    1 person found this answer helpful.