Pyspark Notebook add file name and folder to Data frame

Debbie Edwards 521 Reputation points
2023-11-22T10:46:26.32+00:00
%%pyspark
df = spark.read.load('abfss://rawdata@***************.dfs.core.windows.net/2021-2022/file.csv', format='csv'
## If header exists uncomment line below
, header=True
)
display(df.limit(10))

this displays the data I want to use in my process. But is there any way i can also add the FileName and Folder as a column to this df so I can use it later?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,651 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. ShaikMaheer-MSFT 38,326 Reputation points Microsoft Employee
    2023-11-24T06:01:22.2533333+00:00

    Hi Debbie Edwards,

    Thank you for posting query in Microsoft Q&A Platform.

    We can use withColumn function to add extra columns to dataframe. In this case using withColumn function we can add file and folder details to dataframe.

    Kindly check below sample code.

    %%pyspark
    df = spark.read.load('abfss://rawdata@***************.dfs.core.windows.net/2021-2022/file.csv', format='csv'
    ## If header exists uncomment line below
    , header=True
    )
    
    df.withColumn('folder', lit('myFolder'))
    df.withColumn('file', lit('myFile'))
    
    display(df.limit(10))
    
    
    

    You can consider checking below video to understand about withColumn function.

    withColumn() in PySpark

    Hope this helps. Please let me know if any further queries.


    Please consider hitting Accept Answer button. Accepted answers help community as well.

    1 person found this answer helpful.