Range join optimization
A range join occurs when two relations are joined using a point in interval or interval overlap condition. The range join optimization support in Databricks Runtime can bring orders of magnitude improvement in query performance, but requires careful manual tuning.
Databricks recommends using join hints for range joins when performance is poor.
Point in interval range join
A point in interval range join is a join in which the condition contains predicates specifying that a value from one relation is between two values from the other relation. For example:
-- using BETWEEN expressions
SELECT *
FROM points JOIN ranges ON points.p BETWEEN ranges.start and ranges.end;
-- using inequality expressions
SELECT *
FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end;
-- with fixed length interval
SELECT *
FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.start + 100;
-- join two sets of point values within a fixed distance from each other
SELECT *
FROM points1 p1 JOIN points2 p2 ON p1.p >= p2.p - 10 AND p1.p <= p2.p + 10;
-- a range condition together with other join conditions
SELECT *
FROM points, ranges
WHERE points.symbol = ranges.symbol
AND points.p >= ranges.start
AND points.p < ranges.end;
Interval overlap range join
An interval overlap range join is a join in which the condition contains predicates specifying an overlap of intervals between two values from each relation. For example:
-- overlap of [r1.start, r1.end] with [r2.start, r2.end]
SELECT *
FROM r1 JOIN r2 ON r1.start < r2.end AND r2.start < r1.end;
-- overlap of fixed length intervals
SELECT *
FROM r1 JOIN r2 ON r1.start < r2.start + 100 AND r2.start < r1.start + 100;
-- a range condition together with other join conditions
SELECT *
FROM r1 JOIN r2 ON r1.symbol = r2.symbol
AND r1.start <= r2.end
AND r1.end >= r2.start;
Range join optimization
The range join optimization is performed for joins that:
- Have a condition that can be interpreted as a point in interval or interval overlap range join.
- All values involved in the range join condition are of a numeric type (integral, floating point, decimal),
DATE
, orTIMESTAMP
. - All values involved in the range join condition are of the same type. In the case of the decimal type, the values also need to be of the same scale and precision.
- It is an
INNER JOIN
, or in case of point in interval range join, aLEFT OUTER JOIN
with point value on the left side, orRIGHT OUTER JOIN
with point value on the right side. - Have a bin size tuning parameter.
Bin size
The bin size is a numeric tuning parameter that splits the values domain of the range condition into multiple bins of equal size. For example, with a bin size of 10, the optimization splits the domain into bins that are intervals of length 10.
If you have a point in range condition of p BETWEEN start AND end
, and start
is 8 and end
is 22, this value interval overlaps with three bins of length 10 – the first bin from 0 to 10, the second bin from 10 to 20, and the third bin from 20 to 30. Only the points that fall within the same three bins need to be considered as possible join matches for that interval. For example, if p
is 32, it can be ruled out as falling between start
of 8 and end
of 22, because it falls in the bin from 30 to 40.
Note
- For
DATE
values, the value of the bin size is interpreted as days. For example, a bin size value of 7 represents a week. - For
TIMESTAMP
values, the value of the bin size is interpreted as seconds. If a sub-second value is required, fractional values can be used. For example, a bin size value of 60 represents a minute, and a bin size value of 0.1 represents 100 milliseconds.
You can specify the bin size either by using a range join hint in the query or by setting a session configuration parameter. The range join optimization is applied only if you manually specify the bin size. Section Choose the bin size describes how to choose an optimal bin size.
Enable range join using a range join hint
To enable the range join optimization in a SQL query, you can use a range join hint to specify the bin size. The hint must contain the relation name of one of the joined relations and the numeric bin size parameter. The relation name can be a table, a view, or a subquery.
SELECT /*+ RANGE_JOIN(points, 10) */ *
FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end;
SELECT /*+ RANGE_JOIN(r1, 0.1) */ *
FROM (SELECT * FROM ranges WHERE ranges.amount < 100) r1, ranges r2
WHERE r1.start < r2.start + 100 AND r2.start < r1.start + 100;
SELECT /*+ RANGE_JOIN(c, 500) */ *
FROM a
JOIN b ON (a.b_key = b.id)
JOIN c ON (a.ts BETWEEN c.start_time AND c.end_time)
Note
In the third example, you must place the hint on c
.
This is because joins are left associative, so the query is interpreted as (a JOIN b) JOIN c
,
and the hint on a
applies to the join of a
with b
and not the join with c
.
#create minute table
minutes = spark.createDataFrame(
[(0, 60), (60, 120)],
"minute_start: int, minute_end: int"
)
#create events table
events = spark.createDataFrame(
[(12, 33), (0, 120), (33, 72), (65, 178)],
"event_start: int, event_end: int"
)
#Range_Join with "hint" on the from table
(events.hint("range_join", 60)
.join(minutes,
on=[events.event_start < minutes.minute_end,
minutes.minute_start < events.event_end])
.orderBy(events.event_start,
events.event_end,
minutes.minute_start)
.show()
)
#Range_Join with "hint" on the join table
(events.join(minutes.hint("range_join", 60),
on=[events.event_start < minutes.minute_end,
minutes.minute_start < events.event_end])
.orderBy(events.event_start,
events.event_end,
minutes.minute_start)
.show()
)
You can also place a range join hint on one of the joined DataFrames. In that case, the hint contains just the numeric bin size parameter.
val df1 = spark.table("ranges").as("left")
val df2 = spark.table("ranges").as("right")
val joined = df1.hint("range_join", 10)
.join(df2, $"left.type" === $"right.type" &&
$"left.end" > $"right.start" &&
$"left.start" < $"right.end")
val joined2 = df1
.join(df2.hint("range_join", 0.5), $"left.type" === $"right.type" &&
$"left.end" > $"right.start" &&
$"left.start" < $"right.end")
Enable range join using session configuration
If you don’t want to modify the query, you can specify the bin size as a configuration parameter.
SET spark.databricks.optimizer.rangeJoin.binSize=5
This configuration parameter applies to any join with a range condition. However, a different bin size set through a range join hint always overrides the one set through the parameter.
Choose the bin size
The effectiveness of the range join optimization depends on choosing the appropriate bin size.
A small bin size results in a larger number of bins, which helps in filtering the potential matches.
However, it becomes inefficient if the bin size is significantly smaller than the encountered value intervals, and the value intervals overlap multiple bin intervals. For example, with a condition p BETWEEN start AND end
, where start
is 1,000,000 and end
is 1,999,999, and a bin size of 10, the value interval overlaps with 100,000 bins.
If the length of the interval is fairly uniform and known, we recommend that you set the bin size to the typical expected length of the value interval. However, if the length of the interval is varying and skewed, a balance must be found to set a bin size that filters the short intervals efficiently, while preventing the long intervals from overlapping too many bins. Assuming a table ranges
, with intervals that are between columns start
and end
, you can determine different percentiles of the skewed interval length value with the following query:
SELECT APPROX_PERCENTILE(CAST(end - start AS DOUBLE), ARRAY(0.5, 0.9, 0.99, 0.999, 0.9999)) FROM ranges
A recommended setting of bin size would be the maximum of the value at the 90th percentile, or the value at the 99th percentile divided by 10, or the value at the 99.9th percentile divided by 100 and so on. The rationale is:
- If the value at the 90th percentile is the bin size, only 10% of the value interval lengths are longer than the bin interval, so span more than 2 adjacent bin intervals.
- If the value at the 99th percentile is the bin size, only 1% of the value interval lengths span more than 11 adjacent bin intervals.
- If the value at the 99.9th percentile is the bin size, only 0.1% of the value interval lengths span more than 101 adjacent bin intervals.
- The same can be repeated for the values at the 99.99th, the 99.999th percentile, and so on if needed.
The described method limits the amount of skewed long value intervals that overlap multiple bin intervals. The bin size value obtained this way is only a starting point for fine tuning; actual results may depend on the specific workload.