@Alok Thampi - Thanks for the question and using MS Q&A platform.
When joining large tables in Databricks, there are a few things you can do to optimize performance:
Partitioning: Make sure that both tables are partitioned on the join key. This will ensure that the data is co-located on the same worker nodes, which can significantly improve performance. You can use the PARTITION BY
clause when creating the Delta tables to partition on the join key.
Z-Ordering: Z-Ordering is a technique that can be used to optimize queries that filter or join on specific columns. It reorders the data in each partition based on the values of one or more columns, which can improve query performance by reducing the amount of data that needs to be read. You have already optimized the tables using Z-Ordering, which is a good step.
Cluster configuration: Make sure that your cluster is properly configured for the size of your data and the complexity of your queries. You can try increasing the number of worker nodes or using a more powerful instance type to improve performance.
Caching: If you are running the same query multiple times, you can cache the tables in memory to improve performance. This will reduce the amount of data that needs to be read from disk each time the query is run.
Reduce data size: If possible, try to reduce the size of the data by filtering out unnecessary columns or rows before joining the tables. This can significantly reduce the amount of data that needs to be processed.
In your case, since the PurchaseOrder table contains approximately 2 Billion records, you may want to consider using a distributed join strategy such as broadcast join or shuffle join. Broadcast join is useful when one of the tables is small enough to fit in memory, while shuffle join is useful when both tables are large. You can use the broadcast
hint to force a broadcast join, or the shuffle
hint to force a shuffle join.
Here's an example of how to use the broadcast
hint:
SELECT /*+ BROADCAST(POL) */
<applicable columns>
FROM ReportTable G
LEFT JOIN EKBETable EKBE ON EKBE.BELNR = G.ORDER_ID
LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO
And here's an example of how to use the shuffle
hint:
SELECT /*+ SHUFFLE */
<applicable
For more details, refer to When to partition tables on Azure Databricks and Spark Optimization : Reducing Shuffle
Disclaimer: This response contains a reference to a third-party World Wide Web site. Microsoft is providing this information as a convenience to you. Microsoft does not control these sites and has not tested any software or information found on these sites; therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. There are inherent dangers in the use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet.
Hope this helps. Do let us know if you have any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.