Convert dataframe column datatypes according to target table column datatypes - Databricks

Question

Hi All,

I am currently loading the csv file, all with string datatypes columns to dataframe df. I want to write this data into an existing table which has different datatypes. So basically in csv file I have a column CreatedDate as string and my staging table has it as Datetime.
When i try to load it, it fails with the error:

Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

This is my piece of code wherein I try to load mutliple csv files into a corresponding existing table. I truncate the table and append the csv data.

names = ["ISSUES","PROJECTS","RISKS","TASKS","ASSIGNMENT_BASELINES","PROJECT_BASELINES","TASK_BASELINES","RESOURCES","TIMESET","ASSIGNMENTS"]

#loop to table names and process the csv files to Synpase Database
for name in names:
  for file in files:
    if name in file:
      print("File name:- {}".format(file))
      df = spark.read.csv(wasbs_path + "/"+file, header = 'true')
      df = df.withColumn("LOAD_DATE",current_timestamp())
      df.schema
      df.write \
        .format("com.databricks.spark.sqldw") \
        .option("url", dwUrl) \
        .option("forwardSparkAzureStorageCredentials", "true") \
        .option("dbTable", "dbo.zz_"+name+"_raw_delete") \
        .option("tempDir", tempDir) \
        .option("truncate","true") \
        .option("maxStrLength", "4000" ) \
        .mode("append") \
        .save()

Could anyone please help on how can i handle this? If there is another way to do this? Any suggestions would be appreciated

Thanks,
Ashish.

Accepted Answer

Hello @Ashish Sinha ,

Welcome to the Microsoft Q&A platform.

Make sure the data frame column data types must match with the column data types in the target table.

Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that doesn’t match the table’s schema. To determine whether a write to a table is compatible, Delta Lake uses the following rules to determine whether a write from a DataFrame to a table is compatible:

All Dataframe Columns must exist in the target table. If there is a column in Dataframe that is not present in the target table, an exception will raise. Columns present in the target table but not in the Dataframe are set to null.
Dataframe column data types must match the column data types in the target table. If they don’t match, an exception is raised.
Dataframe column names cannot differ only by case. This means that we cannot have columns such as ‘Foo’ and ‘foo’ defined in the same table. While Spark can be used in case sensitive or insensitive (default) mode, Delta Lake is case-preserving but insensitive when storing the schema.

For more details, refer to Table batch reads and writes – Schema validation.

Hope this helps. Do let us know if you any further queries.

------------

Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
Want a reminder to come back and check responses? Here is how to subscribe to a notification.

Answer

Hi @PRADEEPCHEEKATLA-MSFT

Thanks for such clear explanation. Really appreciate it.
As my existing table has a different schema than the file I am trying to load. I have used the below code to convert them:

df = df.withColumn(col,to_timestamp(df[col], 'yyyy-MM-dd HH:mm:ss'))

Share via

Convert dataframe column datatypes according to target table column datatypes - Databricks

1 additional answer