partition by column

arkiboys 9,706 Reputation points
2022-10-30T19:01:37.947+00:00

hello,
This is how I am writing dataframe into parquet single file which works fine.
df.write.parquet(f"/mnt/storage/{container_name}{folder_path_mounted}", mode='overwrite')

Now, I would like to partition by yearNo, monthNo, dayNo but I get this error.
AssertionError: col should be Column --> this is highlighted on the _year row below
Any suggestions, thank you

import datetime
currentDateTime = datetime.datetime.now()

yearNo = currentDateTime.year
monthNo = f"{currentDateTime.month:02}"
dayNo = f"{currentDateTime.day:02}"

df_final = df.withColumn('ingestion_date', current_timestamp())\
.withColumn('_year', yearNo) \
.withColumn('_month', monthNo) \
.withColumn('_day', dayNo) \

df.write.parquet(f"/mnt/storage/{container_name}{folder_path_mounted}/data", mode='overwrite').partitionBy('_year', '_month', '_day')

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
0 comments No comments
{count} votes

Answer accepted by question author
  1. ShaikMaheer-MSFT 38,631 Reputation points Microsoft Employee Moderator
    2022-11-01T15:17:17.747+00:00

    Hi @arkiboys ,

    Thank you for posting query in Microsoft Q&A Platform.

    withColumn() function takes first argument as string which indicates column name and second argument as Column class object. In your code you are using yearNo, monthNo & dayNo as second arguments which are not type of column. Hence seeing error.

    Consider writing code using lit() function as shown below.
    256140-image.png

    from pyspark.sql.functions import col,lit  
    import datetime  
    currentDateTime = datetime.datetime.now()  
      
    yearNo = currentDateTime.year  
    monthNo = f"{currentDateTime.month:02}"  
    dayNo = f"{currentDateTime.day:02}"  
      
    data = [(1,'Maheer','3000'),(2,'Wafa','4000')]  
    schema = ['id','name','salary']  
    df = spark.createDataFrame(data,schema)  
    df1 = df.withColumn("year",lit(yearNo))\  
            .withColumn("month",lit(monthNo))\  
            .withColumn("day",lit(dayNo))  
      
    df1.show()  
    

    Hope this helps. Please let me know if any further queries.

    -----------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.