I check how many times the statement was executed in SQL database in DB sys.dm_exec_query_stats which shows execution_count 10k for each query INSERT INTO sandbox.testbulkwrite ("id","id2","id3") VALUES (@P0,@P1,@P2). Here, we can see 130k executions after 13 spark writes:
Azure databrick JDBC write to Azure Sql results in enormous audit logs
We are using azure databricks to write data to Azure SQL database. Last week we switched from runtime 9.1 to newer 14.3, however when we write data, it appears, that Spark JDBC now creates "insert into" statements for each row, which results in large DB overhead (especially for large tables) and audit log which we have turned on in Azure SQL database grows enormously.
For examples, when we insert 10k rows/3 cols ( id int, id2 varchar(2), id3 varchar(10)), it creates 10k insert statements, which turns out to be approx. 8 MB of audit log file on blob storage.
It also appears that setting batchsize has no effect on insert process at all.
When we have large tables (30Mil row/17 col) we must cache and repartion dataframe otherwise write often fails on spark INTERNAL_ERROR (spark logs show timeouts) most likely due to DB cant do that many inserts.
Is there a was for JDBC SPARK write to have something like bulk insert? Why is there batchsize in write, if it does have no effect?
Details:
- tested runtime: 14.3 and 15.4
- tested JDBC configurations:
- tried all available drivers and different configurations:
- native JDBC as described here: https://learn.microsoft.com/en-us/azure/databricks/connect/external-systems/jdbc or com.microsoft.sqlserver.jdbc.SQLServerDriver with mssql-jdbc-12.8.1.jre8.jar library installed on cluster from maven (https://mvnrepository.com/artifact/com.microsoft.sqlserver/mssql-jdbc/12.8.1.jre8)
if SPARK_DB_FORMAT in ("jdbc", "com.microsoft.sqlserver.jdbc.SQLServerDriver"): (df_final .write .format(SPARK_DB_FORMAT) .option("url", connString) # .option("driver", SPARK_DB_FORMAT) .mode("append") .option("dbtable", tableName) .option("encrypt", "true") .option("batchsize",100000) .save() )
- sqlserver as described here https://learn.microsoft.com/en-us/azure/databricks/connect/external-systems/sql-server
once we insert data, I check how many time the ste statement was executed in SQL database via sys.dm_exec_query_stats - for each 10k rows, there 10k execution_count of INSERT INTO.if SPARK_DB_FORMAT == "sqlserver": (df_final .write .format(SPARK_DB_FORMAT) .mode("append") .option("host", "llll.database.windows.net") .option("port", "1433") # optional, can use default port 1433 if omitted .option("user", "YYY") .option("ZZZK") .option("database", "YYY") .option("dbtable", "schemaName.tableName") .option("dbtable", tableName) .option("encrypt", "true") .option("batchsize",100000) .save() )
- if we have sql auditing turned on, it results in these huge logs after insert 130k rows, i.e. 130k insert into statements:
-