Performance Issue with UC Read from Federated SQL Table vs JDBC Read from SQL Server

Simonas Solovjovas 0 Reputation points
2024-08-29T11:31:44.7033333+00:00

Hi everyone,

I'm currently facing a significant performance issue when comparing the execution times of a query sent through JDBC versus a similar query executed through Databricks SQL (using Unity Catalog to access a federated SQL table).

JDBC Query:
jdbc_query = f"""
SELECT TOP 1 *
FROM db.schema.table
WHERE id = (
SELECT TOP 1 id
FROM db.schema.table2 )
AND model_id = {model_id}"""

*Execution Time: ~2 seconds
*
Databricks SQL Query (UC):
Since Databricks SQL does not support TOP, I used LIMIT:
uc_query = f"""
SELECT *
FROM db.schema.table
WHERE id = (
SELECT id FROM db.schema.table2
LIMIT 1 )
AND model_id = {model_id}
LIMIT 1
"""
*Execution Time: 6-7 minutes
*
Additional Observations: When I load and display each individual table (without applying any filters or subqueries), the time difference between JDBC and Databricks SQL is only 1-2 seconds.

The Question: Given the significant time difference when running the combined query via Databricks SQL compared to JDBC, I'm trying to understand where these 6-7 minutes are lost.

Is this related to the conversion process from Databricks SQL to SQL Server SQL? Could it be that the subquery or the overall optimization differs between how Databricks SQL and JDBC handle these queries? Any insights, similar experiences, or suggestions on how to improve the performance of the Databricks SQL query would be greatly appreciated!

Thanks in advance!

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,175 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 31,011 Reputation points Microsoft Employee
    2024-08-29T23:29:37.68+00:00

    Hello Simonas Solovjovas ,

    Based on the information provided, it is difficult to determine the exact cause of the performance difference between the JDBC query and the Databricks SQL query. However, there are a few potential factors that could be contributing to the slower execution time of the Databricks SQL query:

    • Databricks SQL may not be optimizing the subquery as efficiently as JDBC. You can try to analyze the execution plans for both queries to identify any differences
    • It seems like in Databricks SQL query, the subquery might be causing a full table scan, leading to longer execution times. Consider rewriting the query to optimize the subquery execution.
    • Another potential factor is the transfer of data between Databricks and SQL Server. Depending on the size of the data being transferred, this could be a significant bottleneck in the query execution.

    Please check these and let me know for any additional questions.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.