Share via

Pyspark Notebook add file name and folder to Data frame

Debbie Edwards 526 Reputation points
2023-11-22T10:46:26.32+00:00
%%pyspark
df = spark.read.load('abfss://rawdata@***************.dfs.core.windows.net/2021-2022/file.csv', format='csv'
## If header exists uncomment line below
, header=True
)
display(df.limit(10))

this displays the data I want to use in my process. But is there any way i can also add the FileName and Folder as a column to this df so I can use it later?

Azure Synapse Analytics
Azure Synapse Analytics

An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.

0 comments No comments

1 answer

Sort by: Most helpful
  1. ShaikMaheer-MSFT 38,636 Reputation points Microsoft Employee Moderator
    2023-11-24T06:01:22.2533333+00:00

    Hi Debbie Edwards,

    Thank you for posting query in Microsoft Q&A Platform.

    We can use withColumn function to add extra columns to dataframe. In this case using withColumn function we can add file and folder details to dataframe.

    Kindly check below sample code.

    %%pyspark
    df = spark.read.load('abfss://rawdata@***************.dfs.core.windows.net/2021-2022/file.csv', format='csv'
    ## If header exists uncomment line below
    , header=True
    )
    
    df.withColumn('folder', lit('myFolder'))
    df.withColumn('file', lit('myFile'))
    
    display(df.limit(10))
    
    
    

    You can consider checking below video to understand about withColumn function.

    withColumn() in PySpark

    Hope this helps. Please let me know if any further queries.


    Please consider hitting Accept Answer button. Accepted answers help community as well.

    1 person found this answer helpful.

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.