Joining huge delta tables in Databricks

Alok Thampi 151 Reputation points
2024-10-08T00:44:02.0833333+00:00

Hello,

I am trying to join few delta tables as per the code below.

select <applicable columns>
FROM ReportTable G
LEFT JOIN EKBETable EKBE ON EKBE.BELNR = G.ORDER_ID
LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO

The PurchaseOrder table contains approximately 2 Billion records and the EKBE table contains ~500 million records. The last join (LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO) has a huge performance hit and the code keeps running for ever. There are duplicate EBELN and PO_NO values in both tables adding more heaviness to the join.

I have run the optimize / zorder on both the tables based on the joining keys as below but still it does't seem to work.

EKBETable : OPTIMIZE EKBETable ZORDER BY (BELNR)

PurchaseOrder : OPTIMIZE PurchaseOrder ZORDER BY (PO_NO)What would be the best way to optmize this join? I am using the below cluster configuration.

User's image

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA 90,641 Reputation points Moderator
    2024-10-08T08:12:48.7466667+00:00

    @Alok Thampi - Thanks for the question and using MS Q&A platform.

    When joining large tables in Databricks, there are a few things you can do to optimize performance:

    Partitioning: Make sure that both tables are partitioned on the join key. This will ensure that the data is co-located on the same worker nodes, which can significantly improve performance. You can use the PARTITION BY clause when creating the Delta tables to partition on the join key.

    Z-Ordering: Z-Ordering is a technique that can be used to optimize queries that filter or join on specific columns. It reorders the data in each partition based on the values of one or more columns, which can improve query performance by reducing the amount of data that needs to be read. You have already optimized the tables using Z-Ordering, which is a good step.

    Cluster configuration: Make sure that your cluster is properly configured for the size of your data and the complexity of your queries. You can try increasing the number of worker nodes or using a more powerful instance type to improve performance.

    Caching: If you are running the same query multiple times, you can cache the tables in memory to improve performance. This will reduce the amount of data that needs to be read from disk each time the query is run.

    Reduce data size: If possible, try to reduce the size of the data by filtering out unnecessary columns or rows before joining the tables. This can significantly reduce the amount of data that needs to be processed.

    In your case, since the PurchaseOrder table contains approximately 2 Billion records, you may want to consider using a distributed join strategy such as broadcast join or shuffle join. Broadcast join is useful when one of the tables is small enough to fit in memory, while shuffle join is useful when both tables are large. You can use the broadcast hint to force a broadcast join, or the shuffle hint to force a shuffle join.

    Here's an example of how to use the broadcast hint:

    SELECT /*+ BROADCAST(POL) */
      <applicable columns>
    FROM ReportTable G
    LEFT JOIN EKBETable EKBE ON EKBE.BELNR = G.ORDER_ID
    LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO
    
    
    

    And here's an example of how to use the shuffle hint:

    SELECT /*+ SHUFFLE */
      <applicable
    
    

    For more details, refer to When to partition tables on Azure Databricks and Spark Optimization : Reducing Shuffle

    Disclaimer: This response contains a reference to a third-party World Wide Web site. Microsoft is providing this information as a convenience to you. Microsoft does not control these sites and has not tested any software or information found on these sites; therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. There are inherent dangers in the use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet.

    Hope this helps. Do let us know if you have any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.