Azure databrick JDBC write to Azure Sql results in enormous audit logs

Question

Azure databrick JDBC write to Azure Sql results in enormous audit logs

Chalabala, (Jan) 0

We are using azure databricks to write data to Azure SQL database. Last week we switched from runtime 9.1 to newer 14.3, however when we write data, it appears, that Spark JDBC now creates "insert into" statements for each row, which results in large DB overhead (especially for large tables) and audit log which we have turned on in Azure SQL database grows enormously.

For examples, when we insert 10k rows/3 cols ( id int, id2 varchar(2), id3 varchar(10)), it creates 10k insert statements, which turns out to be approx. 8 MB of audit log file on blob storage.

It also appears that setting batchsize has no effect on insert process at all.

When we have large tables (30Mil row/17 col) we must cache and repartion dataframe otherwise write often fails on spark INTERNAL_ERROR (spark logs show timeouts) most likely due to DB cant do that many inserts.

Is there a was for JDBC SPARK write to have something like bulk insert? Why is there batchsize in write, if it does have no effect?

Details:

tested runtime: 14.3 and 15.4
tested JDBC configurations:
tried all available drivers and different configurations:

native JDBC as described here: https://learn.microsoft.com/en-us/azure/databricks/connect/external-systems/jdbc or com.microsoft.sqlserver.jdbc.SQLServerDriver with mssql-jdbc-12.8.1.jre8.jar library installed on cluster from maven (https://mvnrepository.com/artifact/com.microsoft.sqlserver/mssql-jdbc/12.8.1.jre8)

  if SPARK_DB_FORMAT in ("jdbc", "com.microsoft.sqlserver.jdbc.SQLServerDriver"):
                  (df_final
                  .write
                  .format(SPARK_DB_FORMAT)
                  .option("url", connString)
                  # .option("driver", SPARK_DB_FORMAT)
                  .mode("append")
                  .option("dbtable", tableName)
                  .option("encrypt", "true")
                  .option("batchsize",100000)
                  .save()
                  )

sqlserver as described here https://learn.microsoft.com/en-us/azure/databricks/connect/external-systems/sql-server

  
  if SPARK_DB_FORMAT == "sqlserver":
                (df_final
                  .write
                  .format(SPARK_DB_FORMAT)
                  .mode("append")
                  .option("host", "llll.database.windows.net")
                  .option("port", "1433")  # optional, can use default port 1433 if omitted
                  .option("user", "YYY")
                  .option("ZZZK")
                  .option("database", "YYY")
                  .option("dbtable", "schemaName.tableName")
                  .option("dbtable", tableName)
                  .option("encrypt", "true")
                  .option("batchsize",100000)
                  .save()
                  )

once we insert data, I check how many time the ste statement was executed in SQL database via sys.dm_exec_query_stats - for each 10k rows, there 10k execution_count of INSERT INTO.

if we have sql auditing turned on, it results in these huge logs after insert 130k rows, i.e. 130k insert into statements:

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2024-09-12T18:56:28.3033333+00:00
Hi @Chalabala, (Jan)

Thanks for using MS Q&A platform.
Based on your use case, I would recommend referring to the following articles for more information on optimizing the Spark JDBC write process to Azure SQL database:

https://community.databricks.com/t5/data-engineering/databricks-jdbc-odbc-write-batch-size/td-p/10059

https://techcommunity.microsoft.com/t5/azure-sql-blog/turbo-boost-data-loads-from-spark-using-sql-spark-connector/ba-p/305523

These articles provide useful information on how to use the bulk insert feature, increase the batch size, and optimize the write process for large tables. They also provide tips and best practices for improving the performance of the Spark JDBC write process.

I hope this helps. Let me know if you have any further questions.
Chalabala, (Jan) 0 Reputation points

2024-09-13T08:26:54.7733333+00:00

Hi, I check both, however, these only confirm my thoughs, that for newest spark 3.5 on databricks 14.3+, there is really not JDBC driver which can perform BULK insert in Azure SQL.
Chalabala, (Jan) 0 Reputation points

2024-09-13T08:46:45+00:00

As of now, we are testing https://github.com/microsoft/sql-spark-connector though it has compatible version only with spark 3.4, so far it appears it can perform bulk insert correctly on spark 3.5 (databricks runtime 14.3). It would be appreciated, if Microsoft actually kept this library alive as there multiple requests for this.

SPARK_DB_FORMAT: ``str = "com.microsoft.sqlserver.jdbc.spark"

SPARK_DB_DRIVER: ``str = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

(spark_df

.write

.``format(SPARK_DB_FORMAT)

.``option("driver", SPARK_DB_DRIVER)

.``option("url", CORE_JDBC_CONN_STR)

.``mode("append")

.``option("dbtable", "sandbox.testbulkwrite")

.``option("encrypt", "true")

.``option("batchsize","10000")

.``option("tableLock", "true")

.``option("schemaCheckEnabled", "true")

.``save()

)

and manually install JAR from release page spark-mssql-connector_2.12-1.4.0-BETA.jar
Chalabala, (Jan) 0 Reputation points

2024-09-13T08:48:48.36+00:00

As of now, we are testing https://github.com/microsoft/sql-spark-connector though it has compatible version only with spark 3.4, so far it appears it can perform bulk insert correctly on spark 3.5 (databricks runtime 14.3). It would be appreciated, if Microsoft actually kept this library alive as there multiple requests for this.

SPARK_DB_FORMAT: str = "com.microsoft.sqlserver.jdbc.spark"

SPARK_DB_DRIVER: str = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

(spark_df

.write

.format(SPARK_DB_FORMAT)

.option("driver", SPARK_DB_DRIVER)

.option("url", CORE_JDBC_CONN_STR)

.mode("append")

.option("dbtable", "sandbox.testbulkwrite")

.option("encrypt", "true")

.option("batchsize","10000")

.option("tableLock", "true")

.option("schemaCheckEnabled", "true")

.save()

)

and manually install JAR from release page spark-mssql-connector_2.12-1.4.0-BETA.jar
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2024-09-17T21:50:05.5133333+00:00

@Chalabala, (Jan) Thank you for reaching out to us about the issue with Spark JDBC write to Azure SQL database resulting in enormous audit logs. I understand that you've switched from runtime 9.1 to 14.3 and are experiencing performance issues with the insert into statements.

I've reviewed the articles I previously suggested, and I understand that they may not have provided the solution you were looking for. I appreciate your feedback and willingness to explore alternative solutions.It's great to hear that you're testing the sql-spark-connector library, which seems to be performing bulk inserts correctly on Spark 3.5 (Databricks runtime 14.3).

Appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

I hope this helps. Let me know if you have any further questions.

1 answer

Your answer

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2024-09-12T18:56:28.3033333+00:00

Hi @Chalabala, (Jan)

Thanks for using MS Q&A platform.
Based on your use case, I would recommend referring to the following articles for more information on optimizing the Spark JDBC write process to Azure SQL database:

https://community.databricks.com/t5/data-engineering/databricks-jdbc-odbc-write-batch-size/td-p/10059

https://techcommunity.microsoft.com/t5/azure-sql-blog/turbo-boost-data-loads-from-spark-using-sql-spark-connector/ba-p/305523

These articles provide useful information on how to use the bulk insert feature, increase the batch size, and optimize the write process for large tables. They also provide tips and best practices for improving the performance of the Spark JDBC write process.

I hope this helps. Let me know if you have any further questions.
Chalabala, (Jan) 0 Reputation points

2024-09-13T08:26:54.7733333+00:00

Hi, I check both, however, these only confirm my thoughs, that for newest spark 3.5 on databricks 14.3+, there is really not JDBC driver which can perform BULK insert in Azure SQL.
Chalabala, (Jan) 0 Reputation points

2024-09-13T08:46:45+00:00

As of now, we are testing https://github.com/microsoft/sql-spark-connector though it has compatible version only with spark 3.4, so far it appears it can perform bulk insert correctly on spark 3.5 (databricks runtime 14.3). It would be appreciated, if Microsoft actually kept this library alive as there multiple requests for this.

SPARK_DB_FORMAT: ``str = "com.microsoft.sqlserver.jdbc.spark"

SPARK_DB_DRIVER: ``str = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

(spark_df

.write

.``format(SPARK_DB_FORMAT)

.``option("driver", SPARK_DB_DRIVER)

.``option("url", CORE_JDBC_CONN_STR)

.``mode("append")

.``option("dbtable", "sandbox.testbulkwrite")

.``option("encrypt", "true")

.``option("batchsize","10000")

.``option("tableLock", "true")

.``option("schemaCheckEnabled", "true")

.``save()

)

and manually install JAR from release page spark-mssql-connector_2.12-1.4.0-BETA.jar
Chalabala, (Jan) 0 Reputation points

2024-09-13T08:48:48.36+00:00

As of now, we are testing https://github.com/microsoft/sql-spark-connector though it has compatible version only with spark 3.4, so far it appears it can perform bulk insert correctly on spark 3.5 (databricks runtime 14.3). It would be appreciated, if Microsoft actually kept this library alive as there multiple requests for this.

SPARK_DB_FORMAT: str = "com.microsoft.sqlserver.jdbc.spark"

SPARK_DB_DRIVER: str = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

(spark_df

.write

.format(SPARK_DB_FORMAT)

.option("driver", SPARK_DB_DRIVER)

.option("url", CORE_JDBC_CONN_STR)

.mode("append")

.option("dbtable", "sandbox.testbulkwrite")

.option("encrypt", "true")

.option("batchsize","10000")

.option("tableLock", "true")

.option("schemaCheckEnabled", "true")

.save()

)

and manually install JAR from release page spark-mssql-connector_2.12-1.4.0-BETA.jar
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2024-09-17T21:50:05.5133333+00:00

@Chalabala, (Jan) Thank you for reaching out to us about the issue with Spark JDBC write to Azure SQL database resulting in enormous audit logs. I understand that you've switched from runtime 9.1 to 14.3 and are experiencing performance issues with the insert into statements.

I've reviewed the articles I previously suggested, and I understand that they may not have provided the solution you were looking for. I appreciate your feedback and willingness to explore alternative solutions.It's great to hear that you're testing the sql-spark-connector library, which seems to be performing bulk inserts correctly on Spark 3.5 (Databricks runtime 14.3).

Appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

I hope this helps. Let me know if you have any further questions.

Answer 1

Chalabala, (Jan) 0

I check how many times the statement was executed in SQL database in DB sys.dm_exec_query_stats which shows execution_count 10k for each query INSERT INTO sandbox.testbulkwrite ("id","id2","id3") VALUES (@P0,@P1,@P2). Here, we can see 130k executions after 13 spark writes:

obrazek

Share via

Azure databrick JDBC write to Azure Sql results in enormous audit logs

1 answer

Your answer