- Pyspark allows you to read many files recursively and defining filter, which is applied to file names. Check here: https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-global-filter
- your requirements would need some programming. I would suggest (if I get it correct):
- transform your SAMPLE DATAFRAME 1 into array and loop thru it
- based on value in the column country_name load the the appropriate csv file(-s) into a dataframe
- do something about it
How to read all the files from a ADLS folder and load data in dataframe uisng pyspark in Synapse?
I am Looking for an solution where i want to read all the files from a folder Name as **'COUNTRIES DETAIL' containing another folder 'YEAR' which is there in a Container in ADLS GEN2 name 'DETAILS'. . Currently i have 200 files in my sub folder 'YEAR'. the format of files are 'YYYY_DETAILS_GERMANY_GERMAN_.csv' , 'YYYY_DETAILS_INDIA_GOOD_.csv', 'YYYY_DETAILS_ENGLAND_PRODUCTS_.csv'.
I am looking for a approach through which i can read all the files from the Folder and use that file only which is required for transformations. The Scenario i want to create is once i read the files from the folder, i want to pick only that file which is required for operations.
HOW WE SHOULD PICK THE FILE OR WHICH FILE TO PICK??
I am reading an file already from another folder and stored that data in dataframe df . The dataframe have a column called 'COUNTRY_NAME'. There will be common value in 'COUNTRY_NAME' column in the dataframe. Suppose we have value as 'INDIAGOOD' in COUNTRY_NAME Column so i need to pick 'YYYY_DETAILS_INDIA_GOOD_.csv' file data and load into my another Dataframe . simlilarly for other cases i need to implement this same scenario. if the value in COUNTRY_NAME column matches with the filename present in adls i want to load that file data into dataframe and do operations. There will always be common value in COUNTRY_NAME Column which have some matching Text with the file that is there in 'YEAR FOLDER'
SAMPLE DATAFRAME 1
+-----------+
COUNTRY_NAME|
+-----------+
INDIAGOOD
INDIAGOOD
INDIAGOOD
INDIAGOOD
FILENAMES IN ADLS+---------------------------------+
FILE_NAMES |
+---------------------------------+
YYYY_DETAILS_INDIA_GOOD_.csv |
YYYY_DETAILS_ENGLAND_PRODUCTS_.csv|
YYYY_DETAILS_GERMANY_GERMAN_.csv |
As we have value 'INDIAGOOD' in my column value 'COUNTRY_NAME' i will pick file YYYY_DETAILS_INDIA_GOOD_.csv from the adls because its substring matching with filename in adls and load it into my dataframe df2
1 answer
Sort by: Most helpful
-
Evgeny Grishchenko 486 Reputation points
2022-11-26T11:50:43.287+00:00