I found a way - essentially use df.write to write to a staging table instead and then using the post-action to insert to the actual destination. Something like:
df.write \
.format("com.databricks.spark.sqldw") \
.mode("overwrite") \
.option("url", "jdbc:sqlserver://host:1433;database=db;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;") \
.option("dbTable", "dummy_table_staging") \
.option("tempDir",
f"wasbs://container@storageaccount.blob.core.windows.net/tempdir") \
.option("user", ""jdbcUser") \
.option("password", "jdbcPassword") \
.option("postActions", "INSERT INTO dummy_table (other_column) SELECT * FROM dummy_table_staging; DROP TABLE IF EXISTS dummy_table_staging;") \
.option("forwardSparkAzureStorageCredentials", "true") \
.save()
The key is that in the postAction, we have control over the column list, i.e. we can just insert the columns that are not IDENTITY.
Things to note:
- mode is overwrite for the staging table - so we always drop and recreate
- we clean up the staging table in the post-action as well if needed (feels cleaner to me to do that)
- the INSERT INTO in the post-action essentially means we always append to the destination table