Thursday, 04 November 2021

The Adaptive Join Threshold

Introduction

First introduced in SQL Server 2017 Enterprise Edition, an adaptive join enables a runtime transition from a batch mode hash join to row mode correlated nested loops (apply) indexed join at runtime. For brevity, I will refer to a ‘correlated nested loops indexed join’ as an apply throughout the rest of this article. If you need a refresher on the difference between nested loops and apply, please see my previous article.

Whether an adaptive join transitions from hash join to apply at runtime depends on a value labelled Adaptive Threshold Rows on the Adaptive Join execution plan operator. This article shows how an adaptive join works, includes details of the threshold calculation, and covers the implications of some of the design choices made.

One thing I want you to bear in mind throughout this piece is that an adaptive join always starts executing as a batch mode hash join. This is true even if the execution plan indicates the adaptive join expects to run as a row mode apply.

Like any hash join, an adaptive join reads all rows available on its build input and copies the required data into a hash table. The batch mode flavour of hash join stores these rows in an optimized format, and partitions them using one or more hash functions. Once the build input has been consumed, the hash table is fully populated and partitioned, ready for the hash join to start checking probe-side rows for matches.

This is the point where an adaptive join makes the decision to proceed with the batch mode hash join, or to transition to a row mode apply. If the number of rows in the hash table is less than the threshold value, the join switches to an apply; otherwise, the join continues as a hash join by starting to read rows from the probe input.

If a transition to an apply join occurs, the execution plan does not re-read the rows used to populate the hash table in order to drive the apply operation. Instead, an internal component known as an adaptive buffer reader expands the rows already stored in the hash table and makes them available on demand to the outer input of the apply operator. There is a cost associated with the adaptive buffer reader, but it is much lower than rewinding the build input completely would be.

Choosing an Adaptive Join

Query optimization involves one or more stages of logical exploration and physical implementation of alternatives. At each stage, when the optimizer explores the physical options for a logical join, it might consider both batch mode hash join and row mode apply alternatives.

If one of those physical join options forms part of the cheapest solution found during the current stage, and the other type of join can deliver the same required logical properties, the optimizer marks the logical join group as potentially suitable for an adaptive join. If not, consideration of an adaptive join ends here (and no adaptive join extended event is fired).

The normal operation of the optimizer means that the cheapest solution found will only include one of the physical join options — either hash or apply — whichever had the lowest estimated cost. The next thing the optimizer does is to build and cost a fresh implementation of the type of join that was not chosen as cheapest.

Since the current optimization phase has already ended with a cheapest solution found, a special single-group exploration and implementation round is performed for the adaptive join. Finally, the optimizer calculates the adaptive threshold.

If any of the preceding work is unsuccessful, the extended event adaptive_join_skipped is fired with a reason.

If the adaptive join processing is successful, a Concat operator is added to the internal plan above the hash and apply alternatives, together with the adaptive buffer reader, and any required batch/row mode adapters. Remember, only one of the join alternatives will execute at runtime, depending on the number of rows actually encountered compared with the adaptive threshold.

The Concat operator and individual hash/apply alternatives are not normally shown in the final execution plan. We are instead presented with a single Adaptive Join operator. This is a just a presentation decision—the Concat and joins are still present in the code run by the SQL Server execution engine. You can find more details around that in the Appendix and Related Reading sections at the end of this article.

The Adaptive Threshold

An apply is generally cheaper than a hash join for a smaller number of driving rows. The hash join has an extra start-up cost to build its hash table, but a lower per-row cost when it starts probing for matches.

There is generally a point where the estimated cost of an apply and hash join will be equal. This idea was nicely illustrated by Joe Sack in his article, Introducing Batch Mode Adaptive Joins:

Adaptive join threshold graph

Calculating the threshold

At this point, the optimizer has a single estimate for the number of rows entering the build input of the hash join and apply alternatives. It also has the estimated cost of the hash and apply operators as a whole.

This gives us a single point on the extreme right edge of the orange and blue lines in the diagram above. The optimizer needs another point of reference for each join type so it can ‘draw the lines’ and find the intersection (it doesn’t literally draw lines, but you get the idea).

To find a second point for the lines, the optimizer asks the two joins to produce a new cost estimate based on a different (and hypothetical) input cardinality. If the first cardinality estimate was more than one hundred rows, it asks the joins to estimate new costs for one row. If the original cardinality was less than or equal to one hundred rows, the second point is based on an input cardinality of ten thousand rows (so there’s a decent enough range to extrapolate).

In any case, the end result is two different costs and row counts for each join type, allowing the lines to be ‘drawn’.

The intersection formula

Finding the intersection of two lines based on two points for each line is a problem with several well-known solutions. SQL Server uses one based on determinants as described on Wikipedia:

Line intersection formula from two points

where:

The first line is defined by the points (x₁, y₁) and (x₂, y₂). The second line is given by the points (x₃, y₃) and (x₄, y₄). The intersection is at (P_x, P_y).

Our scheme has the number of rows on the x-axis and the estimated cost on the y-axis. We are interested in the number of rows where the lines intersect. This is given by the formula for P_x. If we wanted to know the estimated cost at the intersection, that would be P_y.

For P_x rows, the estimated costs of the apply and hash join solutions would be equal. This is the adaptive threshold we need.

A Worked Example

Using the AdventureWorks2017 sample database and the following indexing trick by Itzik Ben-Gan to get unconditional consideration of batch mode execution:

-- Itzik's trick
CREATE NONCLUSTERED COLUMNSTORE INDEX BatchMode
ON Sales.SalesOrderHeader (SalesOrderID)
WHERE SalesOrderID = -1
AND SalesOrderID = -2;

-- Test query
SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123;

The execution plan shows an adaptive join with a threshold of 1502.07 rows:

Adaptive join plan with a threshold of 1502.07 rows

The estimated number of rows driving the adaptive join is 31,465.

Join costs

In this simplified case we can find estimated subtree costs for the hash and apply join alternatives using hints:

-- Hash
SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123
OPTION (HASH JOIN, MAXDOP 1);

Hash join subtree cost 1.05083

-- Apply
SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123
OPTION (LOOP JOIN, MAXDOP 1);

Apply cost 10.0552

That gives us one point on the line for each join type:

31,465 rows
	Hash cost 1.05083
	Apply cost 10.0552

The second point on the line

Since the estimated number of rows is more than a hundred, the second reference points come from special internal estimates based on one join input row. Unfortunately, there is no easy way to obtain the exact cost numbers for this internal calculation (I will talk more about that shortly).

For now, I will just show you the cost numbers (using the full internal precision rather than the six significant figures presented in execution plans):

1 row (internal calculation)
	Hash cost 0.999027422729
	Apply cost 0.547927305023

31,465 rows
	Hash cost 1.05082787359
	Apply cost 10.0552890166

As expected, the apply join is cheaper than the hash for a small input cardinality, but much more expensive for the expected cardinality of 31,465 rows.

The intersection calculation

Plugging those cardinality and cost numbers into the line intersection formula gives:

-- Hash points (x = cardinality; y = cost)
DECLARE 
    @x1 float = 1, 
    @y1 float = 0.999027422729,
    @x2 float = 31465, 
    @y2 float = 1.05082787359;

-- Apply points (x = cardinality; y = cost)
DECLARE
    @x3 float = 1, 
    @y3 float = 0.547927305023,
    @x4 float = 31465, 
    @y4 float = 10.0552890166;

-- Formula:
SELECT Threshold =
    (
        (@x1 * @y2 - @y1 * @x2) * (@x3 - @x4) - 
        (@x1 - @x2) * (@x3 * @y4 - @y3 * @x4)
    )
    /
    (
        (@x1 - @x2) * (@y3 - @y4) - 
        (@y1 - @y2) * (@x3 - @x4)
    );

-- Returns 1502.06521571273

Rounded to six significant figures, that result matches the 1502.07 rows shown in the adaptive join execution plan:

Adaptive join plan with a threshold of 1502.07 rows

Defect or Design?

Recall that SQL Server needs four points to ‘draw’ the row count versus cost lines to find the adaptive join threshold. In the present case, that means finding cost estimations for the one-row and 31,465-row cardinalities, for both apply and hash join implementations.

The optimizer calls a routine named sqllang!CuNewJoinEstimate to calculate these four costs for an adaptive join. Sadly, there are no trace flags or extended events to provide a handy overview of this activity. The normal trace flags used to investigate optimizer behaviour and display costs do not function here (see the Appendix if you are interested in more detail).

The only way to obtain the one-row cost estimates is to attach a debugger and set a breakpoint after the fourth call to CuNewJoinEstimate in the code for sqllang!CardSolveForSwitch. I used WinDbg to obtain this call stack on SQL Server 2019 CU12:

Call stack

At this point in the code, double-precision floating points costs are stored in four memory locations pointed to by addresses at rsp+b0, rsp+d0, rsp+30, and rsp+28 (where rsp is a CPU register and offsets are in hexadecimal):

Debugger output

The operator subtree cost numbers shown match those used in the adaptive join threshold calculation formula.

About those one-row cost estimates

You may have noticed that the estimated subtree costs for the one-row joins seem quite high for the amount of work involved in joining one row:

1 row
	Hash cost 0.999027422729
	Apply cost 0.547927305023

If you try to produce one-row input execution plans for the hash join and apply examples, you will see much lower estimated subtree costs at the join than those shown above. Likewise, running the original query with a row goal of one (or the number of join output rows expected for an input of one row) will also produce an estimated cost that is way lower than shown.

The reason is the CuNewJoinEstimate routine estimates the one-row case in a way that I think most people would not find intuitive.

The final cost is made up of three main components:

The build input subtree cost.
The local cost of the join.
The probe input subtree cost.

Items 2 and 3 depend on the type of join. For a hash join, they account for the cost of reading all the rows from the probe input, matching them (or not) with the one row in the hash table, and passing the results on to the next operator. For an apply, the costs cover one seek on the lower input to the join, the internal cost of the join itself, and returning the matched rows to the parent operator.

None of that is unusual or surprising.

The cost surprise

The surprise comes on the build side of the join (item 1 in the list). One might expect the optimizer to do some fancy calculation to scale the already-calculated subtree cost for 31,465 rows down to one average row, or something like that.

In fact, both hash and apply one-row join estimates simply use the whole subtree cost for the original cardinality estimate of 31,465 rows. In our running example, this ‘subtree’ is the 0.54456 cost of the batch mode clustered index seek on the header table:

Adaptive join build input cost

To be clear: The build-side estimated costs for the one-row join alternatives use an input cost calculated for 31,465 rows. That should strike you as a bit odd.

As a reminder, the one-row costs computed by CuNewJoinEstimate were:

1 row
	Hash cost 0.999027422729
	Apply cost 0.547927305023

You can see that the total apply cost (~0.54793) is dominated by the 0.54456 build-side subtree cost, with a tiny extra amount for the single inner-side seek, processing the small number of resulting rows within the join, and passing them on to the parent operator.

The estimated one-row hash join cost is higher because the probe side of the plan consists of a full index scan, where all resulting rows must pass through the join. The total cost of the one-row hash join is a little lower than the original 1.05095 cost for the 31,465-row example because there is now only one row in the hash table.

Implications

One would expect a one-row join estimate to be based, in part, on the cost of delivering one row to the driving join input. As we have seen, that is not the case for an adaptive join: Both apply and hash alternatives are saddled with the full estimated cost for 31,465 rows. The rest of the join is costed pretty much as one would expect for a one-row build input.

This intuitively-strange arrangement is why it is difficult (perhaps impossible) to show an execution plan mirroring the calculated costs. We would need to construct a plan that delivers 31,465 rows to the upper join input, but costs the join itself and its inner input as if only one row were present. A tough ask.

The effect of all this is to raise the leftmost point on our intersecting-lines diagram up the y-axis. This in turn affects the slope of the line, and so the intersection point.

Another practical effect is that the calculated adaptive join threshold now depends on the original cardinality estimate at the hash build input, as noted by Joe Obbish in his 2017 blog post. For example, if we change the WHERE clause in the test query to SOH.SalesOrderID <= 55000, the adaptive threshold reduces from 1502.07 to 1259.8 without changing the query plan hash. Same plan, different threshold.

This arises because, as we have seen, the internal one-row cost estimate depends on the build input cost for the original cardinality estimate. This means that different initial build-side estimates will give a different y-axis ‘boost’ to the one-row estimate. This in turn means the line will have a different slope, and so a different intersection point.

Intuition would suggest that the one-row estimate for the same join should always give the same value, regardless of the other cardinality estimate on the line (given that the exact same join with the same properties and row sizes has a close-to-linear relationship between driving rows and cost). This is not the case for an adaptive join.

By design?

I can tell you with some confidence what SQL Server does when calculating the adaptive join threshold. I do not have any special insight as to why it does it this way. That said, there are some reasons to think this arrangement is deliberate and came about after due consideration and feedback from testing. The remainder of this section covers some of my thoughts on this aspect.

An adaptive join is not a straight choice between a normal apply and batch mode hash join. An adaptive join always starts by fully populating the hash table. Only once that work is complete is the decision made to switch to an apply implementation or not.

By that time, we have already incurred potentially significant cost by populating and partitioning the hash join in memory. This may not matter much for the one row case, but it becomes progressively more important as cardinality increases. The unexpected ‘boost’ may be a way to incorporate these realities into the calculation, while retaining a reasonable computation cost.

The SQL Server cost model has long been a bit biased against nested loops join, arguably with some justification. Even the ideal indexed apply case can be slow in practice if the data needed is not already in memory, and the I/O subsystem is not particularly flash, especially with a somewhat random access pattern. Limited amounts of memory and sluggish I/O will not be entirely unfamiliar to users of lower-end cloud-based database engines, for example.

It is possible that practical testing in such environments revealed that an intuitively-costed adaptive join was too quick to transition to an apply. Theory is sometimes only great in theory.

Still, the current situation is not ideal: Caching a plan based on an unusually-low cardinality estimate will produce an adaptive join that is much more reluctant to switch to an apply than it would have been with a larger initial estimate. This is a variety of the parameter-sensitivity problem, but it will be a new consideration of that type for many of us.

Now, it is also possible that using the full build input subtree cost for the leftmost point of the intersecting cost lines is simply an error or oversight that has gone uncorrected. My feeling is that the current implementation is probably a deliberate practical compromise, but you would need someone with access to the design documents and source code to know for sure.

Summary

An adaptive join allows SQL Server to transition from a batch mode hash join to an apply after the hash table has been fully populated. It makes this decision by comparing the number of rows in the hash table with a pre-calculated adaptive threshold.

The threshold is computed by predicting where apply and hash join costs would be equal. To find this point, SQL Server produces a second internal join cost estimate for a different build input cardinality—normally one row.

Surprisingly, the estimated cost for the one-row estimate includes the full build-side subtree cost for the original cardinality estimate (not scaled to one row). This means the threshold value depends on the original cardinality estimate at the build input.

As a consequence, an adaptive join may have a threshold value that is unexpectedly low, meaning the adaptive join is much less likely to transition away from a hash join. It is unclear if this behaviour is by design.

Introducing Batch Mode Adaptive Joins by Joe Sack.
Understanding Adaptive Joins in the product documentation.
Adaptive Join Internals by Dima Piliugin.
How do Batch Mode Adaptive Joins work? on Database Administrators Stack Exchange by Erik Darling.
An Adaptive Join Regression by Joe Obbish.
If You Want Adaptive Joins, You Need Wider Indexes and Is Bigger Better? by Erik Darling.
Parameter Sniffing: Adaptive Joins by Brent Ozar.
Intelligent Query Processing Q&A by Joe Sack.

Appendix

This section covers a couple of adaptive join aspects that were difficult to include in the main text in a natural way.

The expanded adaptive plan

You might try looking at a visual representation of the internal plan using undocumented trace flag 9415 as provided by Dima Piliugin in his excellent adaptive join internals article linked just above. With that flag active, the adaptive join plan for our running example becomes:

Execution plan shown with TF 9415

This is a useful representation to aid understanding, but it is not entirely accurate, complete, or consistent. For example, the Table Spool doesn’t exist — it is a default representation for the adaptive buffer reader that reads rows directly from the batch mode hash table.

The operator properties and cardinality estimates are also a bit all over the place. The output from the adaptive buffer reader (‘spool’) should be 31,465 rows not 121,317. The subtree cost of the apply is incorrectly capped by the parent operator cost. This is normal for showplan, but it makes no sense in an adaptive join context.

There are other inconsistencies as well—too many to usefully list—but that can happen with undocumented trace flags. The expanded plan shown above is not intended for use by end-users, so perhaps it is not entirely surprising. The message here is not to rely too heavily on the numbers and properties shown in this undocumented form.

I should also mention in passing that the finished standard adaptive join plan operator isn’t entirely without its own consistency issues. These stem pretty much exclusively from the hidden details.

For example, the displayed adaptive join properties come from a mixture of the underlying Concat, Hash Join, and Apply operators. You can see an adaptive join reporting batch mode execution for nested loops join (which is impossible) and the elapsed time shown is actually copied from the hidden Concat, not the particular join that executed at runtime.

The usual suspects

We can get some useful information from the sorts of undocumented trace flags normally used to look at optimizer output. For example:

SELECT SOH.SubTotal
FROM Sales.SalesOrderHeader AS SOH
JOIN Sales.SalesOrderDetail AS SOD
    ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE SOH.SalesOrderID <= 75123
OPTION (
    QUERYTRACEON 3604,
    QUERYTRACEON 8607,
    QUERYTRACEON 8612);

Output heavily edited for readability:

*** Output Tree: ***

PhyOp_ExecutionModeAdapter(BatchToRow) Card=121317 Cost=1.05095

    PhyOp_Concat (batch) Card=121317 Cost=1.05325

    PhyOp_HashJoinx_jtInner (batch) Card=121317 Cost=1.05083

        PhyOp_Range Sales.SalesOrderHeader Card=31465 Cost=0.54456

        PhyOp_Filter(batch) Card=121317 Cost=0.397185
            PhyOp_Range Sales.SalesOrderDetail Card=121317 Cost=0.338953

    PhyOp_ExecutionModeAdapter(RowToBatch) Card=121317 Cost=10.0798

        PhyOp_Apply Card=121317 Cost=10.0553

            PhyOp_ExecutionModeAdapter(BatchToRow) Card=31465 Cost=0.544623
                PhyOp_Range Sales.SalesOrderHeader Card=31465 Cost=0.54456

            PhyOp_Filter Card=3.85562 Cost=9.00356
                PhyOp_Range Sales.SalesOrderDetail Card=3.85562 Cost=8.94533

This gives some insight to the estimated costs for the full-cardinality case with hash and apply alternatives without writing separate queries and using hints. As mentioned in the main text, these trace flags are not effective within CuNewJoinEstimate, so we cannot directly see the repeat calculations for the 31,465 row case, nor any of the details for the one-row estimates this way.

Merge join and row mode hash join

Adaptive joins only offer a transition from batch mode hash join to row mode apply. For the reasons why row mode hash join is not supported, see the Intelligent Query Processing Q & A in the Related Reading section. In short, it is thought that row mode hash joins would be too prone to performance regressions.

Switching to a row mode merge join would be another option, but the optimizer does not currently consider this. As I understand it, the feature is unlikely to be expanded in this direction in future.

Some of the considerations are the same as for row mode hash join. In addition, merge join plans tend to be less easily interchangeable with hash join, even if we limit ourselves to indexed merge join (no explicit sort). There is also a much greater distinction between hash and apply than there is between hash and merge. Both hash and merge are suitable for larger inputs while apply is better suited to a smaller driving input. Merge join is not as easily parallelized as hash join, and does not scale as well with increasing thread counts.

Given the motivation for adaptive joins is to cope better with significantly varying input sizes, and that only hash join supports batch mode processing, the choice of batch hash versus row apply is the more natural one. Finally, having three adaptive join choices would significantly complicate the threshold calculation, for potentially little gain.

Thanks for reading.