COPY INTO with column list through Databricks Synapse Connector

Question

I have a Databricks job writing to Synapse, that I'm migrating off to use Polybase so that the writes are more performant.

One of the tables the job is writing to has an IDENTITY column.

Imagine that the table has the following DDL:

CREATE TABLE dbo.dummy_table (  
        generated_id int IDENTITY(1,1) NOT NULL,  
        other_column VARCHAR(128) NOT NULL  
    );

I do not want to write to this column - I want to let Synapse generate the identity. My Databricks job only attempts to write a dataframe that only has "other_column".

The Polybase write, however, fails with "An explicit value for the identity column in table 'dummy_table' can only be specified when a column list is used and IDENTITY_INSERT is ON.".

The code to write goes like this using the Synapse connector:

df.write \  
            .format("com.databricks.spark.sqldw") \  
            .mode("append") \  
            .option("url", "jdbc:sqlserver://host:1433;database=db_name;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;") \  
            .option("dbTable", "dummy_table") \  
            .option("tempDir",  
                    f"wasbs://container@storageaccount.blob.core.windows.net/tempdir") \  
            .option("user", "jdbcuser") \  
            .option("password", "jdbcPassword") \  
            .option("truncate", "true") \  
            .option("forwardSparkAzureStorageCredentials", "true") \  
            .save()

On a closer look, I realized that the COPY INTO query that the connector executes against Synapse actually has an option to specify column list (https://learn.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest#column_list). That option works:

COPY INTO "dummy_table" (other_column 2)  
  FROM 'wasbs://container@storageaccount.blob.core.windows.net/tempdir/2022-03-29/xxxx/yyyy/'  
  WITH  
  (    FILE_TYPE = 'PARQUET',  
    CREDENTIAL = (IDENTITY = 'Storage Account Key', SECRET = '')  
    ,COMPRESSION = 'Snappy',  
    MAXERRORS = 0)  
  OPTION (LABEL = 'Databricks Batch Load; Container Build c03ea50: "dummy_id"');

Questions:

How can I do this with the Databricks Synapse connector?
If it is not possible, how else can I use Polybase/COPY to write to a table that has an IDENTITY column from Databricks?

Answer

I found a way - essentially use df.write to write to a staging table instead and then using the post-action to insert to the actual destination. Something like:

df.write \
            .format("com.databricks.spark.sqldw") \
            .mode("overwrite") \ 
            .option("url", "jdbc:sqlserver://host:1433;database=db;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;") \
            .option("dbTable", "dummy_table_staging") \
            .option("tempDir",
                    f"wasbs://container@storageaccount.blob.core.windows.net/tempdir") \
            .option("user", ""jdbcUser") \
            .option("password", "jdbcPassword") \
            .option("postActions", "INSERT INTO dummy_table (other_column) SELECT * FROM dummy_table_staging; DROP TABLE IF EXISTS dummy_table_staging;") \
            .option("forwardSparkAzureStorageCredentials", "true") \
            .save()

The key is that in the postAction, we have control over the column list, i.e. we can just insert the columns that are not IDENTITY.

Things to note:

mode is overwrite for the staging table - so we always drop and recreate
we clean up the staging table in the post-action as well if needed (feels cleaner to me to do that)
the INSERT INTO in the post-action essentially means we always append to the destination table

COPY INTO with column list through Databricks Synapse Connector

1 answer