question

Tania-6701 avatar image
0 Votes"
Tania-6701 asked ShaikMaheer-MSFT commented

COPY INTO with column list through Databricks Synapse Connector

I have a Databricks job writing to Synapse, that I'm migrating off to use Polybase so that the writes are more performant.

One of the tables the job is writing to has an IDENTITY column.

Imagine that the table has the following DDL:

 CREATE TABLE dbo.dummy_table (
         generated_id int IDENTITY(1,1) NOT NULL,
         other_column VARCHAR(128) NOT NULL
     );

I do not want to write to this column - I want to let Synapse generate the identity. My Databricks job only attempts to write a dataframe that only has "other_column".

The Polybase write, however, fails with "An explicit value for the identity column in table 'dummy_table' can only be specified when a column list is used and IDENTITY_INSERT is ON.".

The code to write goes like this using the Synapse connector:

 df.write \
             .format("com.databricks.spark.sqldw") \
             .mode("append") \
             .option("url", "jdbc:sqlserver://host:1433;database=db_name;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;") \
             .option("dbTable", "dummy_table") \
             .option("tempDir",
                     f"wasbs://container@storageaccount.blob.core.windows.net/tempdir") \
             .option("user", "jdbcuser") \
             .option("password", "jdbcPassword") \
             .option("truncate", "true") \
             .option("forwardSparkAzureStorageCredentials", "true") \
             .save()

On a closer look, I realized that the COPY INTO query that the connector executes against Synapse actually has an option to specify column list (https://docs.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest#column_list). That option works:

 COPY INTO "dummy_table" (other_column 2)
   FROM 'wasbs://container@storageaccount.blob.core.windows.net/tempdir/2022-03-29/xxxx/yyyy/'
   WITH
   (    FILE_TYPE = 'PARQUET',
     CREDENTIAL = (IDENTITY = 'Storage Account Key', SECRET = '')
     ,COMPRESSION = 'Snappy',
     MAXERRORS = 0)
   OPTION (LABEL = 'Databricks Batch Load; Container Build c03ea50: "dummy_id"');

Questions:

  1. How can I do this with the Databricks Synapse connector?

  2. If it is not possible, how else can I use Polybase/COPY to write to a table that has an IDENTITY column from Databricks?


azure-synapse-analyticsazure-databricks
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

Tania-6701 avatar image
0 Votes"
Tania-6701 answered ShaikMaheer-MSFT commented

I found a way - essentially use df.write to write to a staging table instead and then using the post-action to insert to the actual destination. Something like:


 df.write \
             .format("com.databricks.spark.sqldw") \
             .mode("overwrite") \ 
             .option("url", "jdbc:sqlserver://host:1433;database=db;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;") \
             .option("dbTable", "dummy_table_staging") \
             .option("tempDir",
                     f"wasbs://container@storageaccount.blob.core.windows.net/tempdir") \
             .option("user", ""jdbcUser") \
             .option("password", "jdbcPassword") \
             .option("postActions", "INSERT INTO dummy_table (other_column) SELECT * FROM dummy_table_staging; DROP TABLE IF EXISTS dummy_table_staging;") \
             .option("forwardSparkAzureStorageCredentials", "true") \
             .save()

The key is that in the postAction, we have control over the column list, i.e. we can just insert the columns that are not IDENTITY.

Things to note:
- mode is overwrite for the staging table - so we always drop and recreate
- we clean up the staging table in the post-action as well if needed (feels cleaner to me to do that)
- the INSERT INTO in the post-action essentially means we always append to the destination table


· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @Tania-6701,

Awesome job. Thanks for sharing your resolution.

0 Votes 0 ·