An Apache Spark-based analytics platform optimized for Azure.
Broadcast hash join is a technique used in distributed computing environments particularly useful when one of the datasets in the join operation is significantly smaller than the other. Here is an example :
# Example DataFrames
largeDF = spark.read.format("...").load("...")
smallDF = spark.read.format("...").load("...")
Here I proceeded with the Broadcast and Join :
from pyspark.sql.functions import broadcast
# Perform broadcast hash join
joinedDF = largeDF.join(broadcast(smallDF), largeDF["key"] == smallDF["key"])
The smallDF is broadcasted to all nodes, and the join is performed on the column "key".
The operation is executed when an action is called on joinedDF, like joinedDF.show() or joinedDF.write.save("...").
If the small dataset is not small enough for a broadcast hash join, or if there's a need for reshuffling, Spark's optimizer will typically handle this. However, you can also manually repartition your df if needed.