How to convert `null` in Azure Data Factory Copy Activity with PolyBase

Fangzhou Zhang 231 Reputation points
2021-04-13T19:44:50.967+00:00

Hi, I'm trying to copy a Databricks Delta lake table to Synapse, using the Azure Data Factory Copy Activity with PolyBase.

The Delta lake table has two columns which are all nulls. Here are the sample data in the Delta lake table:
87463-screen.jpg

supplier_first_cost is defined as Double, logged_at as Timestamp.

But I got the conversion error when running Copy Activity with Poly in Azure Data Factory for these two null columns:

   ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=,Source=,''Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'supplier_first_cost' contains an invalid value ''. Cannot convert '' to type 'Double'.,Source=Microsoft.DataTransfer.Common,''Type=System.FormatException,Message=Input string was not in a correct format.,Source=mscorlib,''Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'supplier_first_cost' contains an invalid value ''. Cannot convert '' to type 'Double'.,Source=Microsoft.DataTransfer.Common,''Type=System.FormatException,Message=Input string was not in a correct format.,Source=mscorlib,'  

and

   ErrorCode=UserErrorInvalidDataValue,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Column 'logged_at' contains an invalid value ''. Cannot convert '' to type 'DateTime'.,Source=Microsoft.DataTransfer.Common,''Type=System.FormatException,Message=String was not recognized as a valid DateTime.,Source=mscorlib,'  

I can use Databricks fillna to replace null to 0 in supplier_first_cost, 1970-01-01 00:00:00 in logged_at for the Delta lake table, so the Copy Activity works without exceptions.

But is there a general solution to handle null in this PolyBase Copy Activity? For example, if I have hundreds of tables, each of them may have various data types for different columns, to replace all the null value for every column is not an efficient way for me.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,366 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,917 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,535 questions
{count} votes

Accepted answer
  1. Fangzhou Zhang 231 Reputation points
    2021-05-24T16:38:31.667+00:00

    Thanks for the help from ADF PG team, they provide us a quick fix for this:

    1. Open the JSON payload in the copy activity, and remove all the types except string in the “mappings”:
      99242-image.png
      99193-image.png
    2. Uncheck "Use type default" in the Sink tab:
      99202-screen-shot-2021-05-24-at-123500-pm.png

    In the Copy Activity Sink, uncheck the “Use type default”, thus Polybase will fill in “NULL” to the cell when the data is null from the source (Delta Lake). In terms of the target Synapse table, make sure the corresponding column is set to allow null value, otherwise Polybase will fail to insert “NULL” into that cell.

    2 people found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Nicholas Moulton 1 Reputation point
    2021-05-24T16:43:29.627+00:00

    Thanks for sharing the workaround.

    I am not seeing how to make this work though in other copy types outside of polybase.

    0 comments No comments