Joining huge delta tables in Databricks

Question

Joining huge delta tables in Databricks

Alok Thampi 151

Hello,

I am trying to join few delta tables as per the code below.

select <applicable columns>
FROM ReportTable G
LEFT JOIN EKBETable EKBE ON EKBE.BELNR = G.ORDER_ID
LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO

The PurchaseOrder table contains approximately 2 Billion records and the EKBE table contains ~500 million records. The last join (LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO) has a huge performance hit and the code keeps running for ever. There are duplicate EBELN and PO_NO values in both tables adding more heaviness to the join.

I have run the optimize / zorder on both the tables based on the joining keys as below but still it does't seem to work.

EKBETable : OPTIMIZE EKBETable ZORDER BY (BELNR)

PurchaseOrder : OPTIMIZE PurchaseOrder ZORDER BY (PO_NO)What would be the best way to optmize this join? I am using the below cluster configuration.

User's image

Accepted answer

0 additional answers

Your answer

Answer 1

PRADEEPCHEEKATLA 90,641 Moderator

@Alok Thampi - Thanks for the question and using MS Q&A platform.

When joining large tables in Databricks, there are a few things you can do to optimize performance:

Partitioning: Make sure that both tables are partitioned on the join key. This will ensure that the data is co-located on the same worker nodes, which can significantly improve performance. You can use the PARTITION BY clause when creating the Delta tables to partition on the join key.

Z-Ordering: Z-Ordering is a technique that can be used to optimize queries that filter or join on specific columns. It reorders the data in each partition based on the values of one or more columns, which can improve query performance by reducing the amount of data that needs to be read. You have already optimized the tables using Z-Ordering, which is a good step.

Cluster configuration: Make sure that your cluster is properly configured for the size of your data and the complexity of your queries. You can try increasing the number of worker nodes or using a more powerful instance type to improve performance.

Caching: If you are running the same query multiple times, you can cache the tables in memory to improve performance. This will reduce the amount of data that needs to be read from disk each time the query is run.

Reduce data size: If possible, try to reduce the size of the data by filtering out unnecessary columns or rows before joining the tables. This can significantly reduce the amount of data that needs to be processed.

In your case, since the PurchaseOrder table contains approximately 2 Billion records, you may want to consider using a distributed join strategy such as broadcast join or shuffle join. Broadcast join is useful when one of the tables is small enough to fit in memory, while shuffle join is useful when both tables are large. You can use the broadcast hint to force a broadcast join, or the shuffle hint to force a shuffle join.

Here's an example of how to use the broadcast hint:

SELECT /*+ BROADCAST(POL) */
  <applicable columns>
FROM ReportTable G
LEFT JOIN EKBETable EKBE ON EKBE.BELNR = G.ORDER_ID
LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO

And here's an example of how to use the shuffle hint:

SELECT /*+ SHUFFLE */
  <applicable

For more details, refer to When to partition tables on Azure Databricks and Spark Optimization : Reducing Shuffle

Disclaimer: This response contains a reference to a third-party World Wide Web site. Microsoft is providing this information as a convenience to you. Microsoft does not control these sites and has not tested any software or information found on these sites; therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. There are inherent dangers in the use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet.

Hope this helps. Do let us know if you have any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Alok Thampi 151 Reputation points

2024-10-08T11:46:21.49+00:00
Hello Pradeep,

Thanks for your response on this. A couple of more inputs as below.

Partition unfortunately won't work as the joining keys are high cardinality columns, that was the reason why ZOrder was introduced.

Caching also might not be required as I am using the join just once, do you think caching would still help?

The data size also cannot be reduced as we need to do a full load

Broadcast join also would not work as both the tables are huge

I will try with changing the cluster and introducing the shuffle hint.

What would be the ideal cluster size to be used in this scenario?

Is there a way to identify the data size in GB after executing the query? Something like writing the output to a dataframe and looking for its size?

Thanks again for your support!

Thanks,

Alok
Alok Thampi 151 Reputation points

2024-10-08T11:48:03.17+00:00
Alok Thampi 151 Reputation points

2024-10-08T11:48:28.84+00:00
Hello Pradeep,

Thanks for your response on this. A couple of more inputs as below.

Partition unfortunately won't work as the joining keys are high cardinality columns, that was the reason why ZOrder was introduced.

Caching also might not be required as I am using the join just once, do you think caching would still help?

The data size also cannot be reduced as we need to do a full load

Broadcast join also would not work as both the tables are huge

I will try with changing the cluster and introducing the shuffle hint.

What would be the ideal cluster size to be used in this scenario?

Is there a way to identify the data size in GB after executing the query? Something like writing the output to a dataframe and looking for its size?

Thanks again for your support!

Thanks,

Alok
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2024-10-09T03:26:13.1933333+00:00
@Alok Thampi - Thank you for the additional information.

Based on the information you have provided, it seems like you have already optimized the tables using ZORDER. However, there are a few more things you can try to optimize the join:

Broadcast smaller table: Since the PurchaseOrder table contains approximately 2 billion records, it might be a good idea to broadcast the smaller tables (ReportTable and EKBETable) and then join with the PurchaseOrder table. This will reduce the amount of data that needs to be shuffled across the network.

Increase cluster size: If the above step does not help, you can try increasing the cluster size to improve the performance of the join.

Use Delta Lake's Auto Optimize feature: Delta Lake's Auto Optimize feature can help optimize the table layout and improve query performance. You can enable this feature by setting the spark.databricks.delta.optimizeWrite.enabled configuration to true.

Use Delta Lake's Join Hints: Delta Lake's Join Hints can help optimize the join by providing information about the size of the tables and the join keys. You can use the BROADCAST, MERGE, and SHUFFLE_HASH hints to optimize the join. For example, you can try using the BROADCAST hint for the smaller tables and the MERGE hint for the larger tables.

Cluster size: The ideal cluster size depends on various factors such as the size of the data, the complexity of the query, and the available resources. In general, you can start with a larger cluster size and then scale up or down based on the query performance. You can also monitor the cluster metrics such as CPU usage, memory usage, and network usage to determine the optimal cluster size.

Data size: You can use the spark.sql.execution.metrics configuration to enable query metrics and then use the SparkListener API to retrieve the query execution metrics. You can then calculate the data size based on the number of rows and the size of each row. Here is an example code snippet:

from pyspark.sql import SparkSession spark = SparkSession.builder.appName("QueryMetrics").getOrCreate() # Enable query metrics spark.conf.set("spark.sql.execution.metrics.enabled", "true") # Execute the query df = spark.sql("SELECT * FROM ReportTable G LEFT JOIN EKBETable EKBE ON EKBE.BELNR = G.ORDER_ID LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO") # Retrieve the query execution metrics metrics = spark.sparkContext.getOrCreate().getOrCreate().listenerBus.waitUntilEmpty(30000) for metric in metrics: if metric.name == "dataSize": print("Data size: {} GB".format(metric.value / (1024 * 1024 * 1024)))

This code snippet retrieves the query execution metrics and prints the data size in GB.

I hope this helps. Let me know if you have any further questions.

Share via

Joining huge delta tables in Databricks

0 additional answers

Your answer