Convert dataframe column datatypes according to target table column datatypes - Databricks

Ashish Sinha 161 Reputation points
2020-11-17T11:39:40.01+00:00

Hi All,

I am currently loading the csv file, all with string datatypes columns to dataframe df. I want to write this data into an existing table which has different datatypes. So basically in csv file I have a column CreatedDate as string and my staging table has it as Datetime.
When i try to load it, it fails with the error:

Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

This is my piece of code wherein I try to load mutliple csv files into a corresponding existing table. I truncate the table and append the csv data.

names = ["ISSUES","PROJECTS","RISKS","TASKS","ASSIGNMENT_BASELINES","PROJECT_BASELINES","TASK_BASELINES","RESOURCES","TIMESET","ASSIGNMENTS"]

#loop to table names and process the csv files to Synpase Database
for name in names:
  for file in files:
    if name in file:
      print("File name:- {}".format(file))
      df = spark.read.csv(wasbs_path + "/"+file, header = 'true')
      df = df.withColumn("LOAD_DATE",current_timestamp())
      df.schema
      df.write \
        .format("com.databricks.spark.sqldw") \
        .option("url", dwUrl) \
        .option("forwardSparkAzureStorageCredentials", "true") \
        .option("dbTable", "dbo.zz_"+name+"_raw_delete") \
        .option("tempDir", tempDir) \
        .option("truncate","true") \
        .option("maxStrLength", "4000" ) \
        .mode("append") \
        .save()

Could anyone please help on how can i handle this? If there is another way to do this? Any suggestions would be appreciated

Thanks,
Ashish.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,162 questions
0 comments No comments
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 88,716 Reputation points Microsoft Employee
    2020-11-18T09:21:22.967+00:00

    Hello @Ashish Sinha ,

    Welcome to the Microsoft Q&A platform.

    Make sure the data frame column data types must match with the column data types in the target table.

    Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that doesn’t match the table’s schema. To determine whether a write to a table is compatible, Delta Lake uses the following rules to determine whether a write from a DataFrame to a table is compatible:

    • All Dataframe Columns must exist in the target table. If there is a column in Dataframe that is not present in the target table, an exception will raise. Columns present in the target table but not in the Dataframe are set to null.
    • Dataframe column data types must match the column data types in the target table. If they don’t match, an exception is raised.
    • Dataframe column names cannot differ only by case. This means that we cannot have columns such as ‘Foo’ and ‘foo’ defined in the same table. While Spark can be used in case sensitive or insensitive (default) mode, Delta Lake is case-preserving but insensitive when storing the schema.

    For more details, refer to Table batch reads and writes – Schema validation.

    Hope this helps. Do let us know if you any further queries.

    ------------

    • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification.
    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Ashish Sinha 161 Reputation points
    2020-11-18T10:51:13.837+00:00

    Hi @PRADEEPCHEEKATLA-MSFT

    Thanks for such clear explanation. Really appreciate it.
    As my existing table has a different schema than the file I am trying to load. I have used the below code to convert them:

    df = df.withColumn(col,to_timestamp(df[col], 'yyyy-MM-dd HH:mm:ss'))  
    
    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.