Sunday, 09 June 2019

Apply versus Nested Loops Join

SQL is a declarative language. We use SQL to write a logical query specification that defines the results we want. For example, we might write a query using either APPLY or JOIN that logically describes exactly the same results.

It is up to the query optimizer to find an efficient physical implementation of that logical requirement. SQL Server is free to choose any plan it likes, so long as the results are guaranteed to be the same as specified in the original SQL.

The optimizer is capable of transforming an apply to a join and vice versa. It generally tries to rewrite apply to join during initial compilation to maximize the searchable plan space during cost-based optimization. Having transformed an apply to a join early on, it may also consider a transformation back to an apply shape later on to assess the merits of e.g. an index loops join.

Physical Operators

The optimizer’s output may contain both apply and nested loops join physical operations. Both are shown in execution plans as a Nested Loops Join operator, but they have different properties:

Apply : The Nested Loops Join operator has Outer References. These describe parameter values passed from the outer (upper) side of the join to operators on the inner (lower) side of the join. The value of the each parameter may change on each iteration of the loop. The join predicate is evaluated (given the current parameter values) by one or more operators on the inner side of the join. The join predicate is not evaluated at the join itself.

Join : The Nested Loops Join operator has a Predicate (unless it is a cross join). It does not have any Outer References. The join predicate is always evaluated at the join operator.

Nested Loops Join Example

The following queries both produce the same Nested Loops Join plan whether APPLY or JOIN syntax is used in the SQL query specification:

DECLARE @T1 table (c1 integer NULL);
DECLARE @T2 table (c2 integer NULL);

INSERT @T1 (c1)
VALUES (1), (2), (3);

INSERT @T2 (c2)
VALUES (1), (2), (3);

-- Join syntax
SELECT
    T1.c1,
    T2.c2
FROM @T1 AS T1
JOIN @T2 AS T2
    ON T2.c2 > T1.c1;

-- Apply syntax
SELECT 
    T1.c1, 
    CA.c2 
FROM @T1 AS T1
CROSS APPLY 
(
    SELECT T2.c2
    FROM @T2 AS T2
    WHERE T2.c2 > T1.c1
) AS CA;

The queries both describe the same result set regardless of the contents of the tables. The execution plans are identical, and have the same QueryPlanHash value.

In both cases, the query optimizer chose a Nested Loops Join rather than an Apply — the join operator has a Predicate, and no Outer References:

Nested Loops Join plan

This plan executes as follows:

For each row from table @T2:
- For each row from table @T1:
  - Test the predicate T2.c2 > T1.c1 at the join

Join syntax optimization

Starting with the SQL JOIN form of the query:

-- Join syntax
SELECT
    T1.c1,
    T2.c2
FROM @T1 AS T1
JOIN @T2 AS T2
    ON T2.c2 > T1.c1;

We can see the input tree passed to the optimizer using trace flag 8606 (TF 3604 will also need to be on to direct the debug output shown to the SSMS messages tab):

*** Input Tree: ***
LogOp_Project QCOL: [T1].c1 QCOL: [T2].c2
    LogOp_Select
    LogOp_Join
        LogOp_Get TBL: @T1
        LogOp_Get TBL: @T2
        ScaOp_Const TI(bit,ML=1) XVAR(bit,Value=1)
    ScaOp_Comp x_cmpGt
        ScaOp_Identifier QCOL: [T2].c2
        ScaOp_Identifier QCOL: [T1].c1
    AncOp_PrjList

This is logically a cross join of the two tables followed by a relational selection (filter) on the join predicate. The optimizer simplifies this via rule SELonJN (relational selection on join) to an inner join:

*** Simplified Tree: ***
LogOp_Join
    LogOp_Get TBL: @T1
    LogOp_Get TBL: @T2
    ScaOp_Comp x_cmpGt
    ScaOp_Identifier QCOL: [T2].c2
    ScaOp_Identifier QCOL: [T1].c1

The optimizer considers rewriting this logical join to a logical apply during cost-based optimization via the exploration rule JoinToIndexOnTheFly. The replacement candidate features an inner apply instead of a join, and an inner-side Eager Index Spool (extract of trace flag 8619 output shown):

LogOp_Apply (x_jtInner)
    Leaf-Op 3
    LogOp_Spool Index on fly Eager
        Leaf-Op 4
        ScaOp_Comp x_cmpGt
            ScaOp_Identifier QCOL: [T2].c2
            ScaOp_Identifier QCOL: [T1].c1

It also considers swapping the inner join inputs via JoinCommute:

Apply Rule: JoinCommute - A JOIN B -> B JOIN A

LogOp_Join
    Leaf-Op 4
    Leaf-Op 3
    Leaf-Op 2

Having generated a number of logical alternatives, the optimizer moves on to implementation using physical operators. The apply alternatives generated via ApplyToNL and BuildSpool include options for an eager spool, a lazy spool, and no spool at all:

-- Eager spool
PhyOp_Spool Index-on-fly EAGER
      Leaf-Op 4
      Leaf-Op 2

-- Logical Lazy spool
PhyOp_Apply (x_jtInner)
    Leaf-Op 3
    LogOp_Spool Lazy
        Leaf-Op 6
        
-- Physical Lazy spool
PhyOp_Spool LAZY
    Leaf-Op 6

-- No spool
PhyOp_Apply (x_jtInner)
    Leaf-Op 3
    Leaf-Op 6

The optimizer also considers implementing the logical join as a physical Nested Loops Join via implementation rule JNtoNL (join to nested loops):

PhyOp_LoopsJoin x_jtInner 
    Leaf-Op 3
    Leaf-Op 4
    ScaOp_Comp x_cmpGt
        ScaOp_Identifier QCOL: [T2].c2
        ScaOp_Identifier QCOL: [T1].c1

Notice the physical operator is PhyOp_LoopsJoin rather than the PhyOp_Apply seen previously. The optimizer is considering both options.

After costing the promising combinations of join type, join order, and spool type, the optimizer decides that the Nested Loops Join option is cheapest, with a total cost of 0.00657068 units.

The final optimizer output can be seen with TF 8607:

*** Output Tree: ***
PhyOp_LoopsJoin x_jtInner
    PhyOp_Range TBL: @T2
    PhyOp_Range TBL: @T1
    ScaOp_Comp x_cmpGt
    ScaOp_Identifier QCOL: [T2].c2
    ScaOp_Identifier QCOL: [T1].c1

After conversion to a form usable by the execution engine (TF 7352):

Inner Join Nested Loops (0) 
    [QCOL: [T1].c1 TI(int,Null,ML=4)]
    [QCOL: [T2].c2 TI(int,Null,ML=4)]
  Table Scan Table Scan (1) 
        [QCOL: [T2].c2 TI(int,Null,ML=4)]
  Table Scan Table Scan (2) 
        [QCOL: [T1].c1 TI(int,Null,ML=4)]

Apply syntax optimization

Starting now with the APPLY query form:

SELECT 
    T1.c1, 
    CA.c2 
FROM @T1 AS T1
CROSS APPLY 
(
    SELECT T2.c2
    FROM @T2 AS T2
    WHERE T2.c2 > T1.c1
) AS CA;

The input tree now features a logical apply instead of the join we saw before:

LogOp_Project QCOL: [T1].c1 QCOL: [T2].c2
    LogOp_Apply (x_jtInner)
    LogOp_Get TBL: @T1
    LogOp_Project
        LogOp_Select
            LogOp_Get TBL: @T2
            ScaOp_Comp x_cmpGt
                ScaOp_Identifier QCOL: [T2].c2
                ScaOp_Identifier QCOL: [T1].c1
        AncOp_PrjList 
    AncOp_PrjList

The ApplyHandler simplification rule transforms this tree into a join (output from TF 8621 shown):

Rule applied: APPLY stack -> Join stack

*** Simplified Tree: ***
LogOp_Join
    LogOp_Get TBL: @T1
    LogOp_Get TBL: @T2
    ScaOp_Comp x_cmpGt
    ScaOp_Identifier QCOL: [T2].c2
    ScaOp_Identifier QCOL: [T1].c1

This is exactly the same logical tree as when the query was written using JOIN SQL syntax. Optimization continues the same as before with the same resulting Nested Loops Join plan, even though we started with APPLY syntax.

Apply Example

We can modify the test script to produce an Apply plan (whether APPLY or JOIN syntax is used in the query specification) by adding an index to one of the tables:

DECLARE @T1 table (c1 integer NULL);
DECLARE @T2 table (c2 integer NULL INDEX i); -- New index

INSERT @T1 (c1)
VALUES (1), (2), (3);

INSERT @T2 (c2)
VALUES (1), (2), (3);

-- Apply syntax
SELECT T1.c1, CA.c2 
FROM @T1 AS T1
CROSS APPLY 
(
    SELECT * 
    FROM @T2 AS T2
    WHERE T2.c2 > T1.c1
) AS CA;

-- Join syntax
SELECT
    T1.c1,
    T2.c2
FROM @T1 AS T1
JOIN @T2 AS T2
    ON T2.c2 > T1.c1;

Again, these queries describe the same result set and the execution plans are the same for both query forms.

This time, the query optimizer chooses Apply rather than Nested Loops Join, so the join has Outer References and no Predicate:

Apply plan

This plan executes as follows:

For each row from table @T1
- Return each row from table @T2 that matches the predicate T2.c2 > T1.c1. The predicate is evaluated by the Index Seek operator, not at the join.

The important difference is that the current value of column c1 is passed as an outer reference to the inner side of the join so the Index Seek can efficiently locate matching rows:

Index Seek predicate

This is more efficient than testing all rows at the join operator.

Apply Syntax Optimization

Starting our analysis with the APPLY query form:

SELECT T1.c1, CA.c2 
FROM @T1 AS T1
CROSS APPLY 
(
    SELECT * 
    FROM @T2 AS T2
    WHERE T2.c2 > T1.c1
) AS CA;

The input tree features a logical apply:

*** Input Tree: ***
LogOp_Project QCOL: [T1].c1 QCOL: [T2].c2
    LogOp_Apply (x_jtInner)
    LogOp_Get TBL: @T1
    LogOp_Project
        LogOp_Select
            LogOp_Get TBL: @T2
            ScaOp_Comp x_cmpGt
                ScaOp_Identifier QCOL: [T2].c2
                ScaOp_Identifier QCOL: [T1].c1
        AncOp_PrjList 
    AncOp_PrjList

It is transformed to a join by ApplyHandler as before:

Rule applied: APPLY stack -> Join stack

*** Simplified Tree: ***
LogOp_Join
    LogOp_Get TBL: @T1
    LogOp_Get TBL: @T2
    ScaOp_Comp x_cmpGt
    ScaOp_Identifier QCOL: [T2].c2
    ScaOp_Identifier QCOL: [T1].c1

During cost-based optimization, the optimizer considers the same options as before, including:

JoinCommute to swap the join input order.
JoinToIndexOnTheFly to convert the join to an apply with inner-side eager index spool.

The changed schema (new index on table @T2) means a new rule JNtoIdxLookup now matches the optimization tree. As the name suggests, this new exploration rule generates a logically-equivalent subtree by replacing a join with an apply plus inner-side index lookup:

Apply Rule: JNtoIdxLookup - JN -> IDX LOOKUP

PhyOp_Apply lookup TBL: @T2 (1) (x_jtInner)
    Leaf-Op 3
    LogOp_SelectIdx(1)
        LogOp_GetIdx TBL: @T2 index 1
        ScaOp_Comp x_cmpGt
            ScaOp_Identifier QCOL: [T2].c2
            ScaOp_Identifier QCOL: [T1].c1

The optimizer also considers lazy spool and no spool options for the apply alternatives generated by JoinToIndexOnTheFly, but the apply option generated by JNtoIdxLookup turns out to be cheapest, with a final plan cost of 0.00657038 units:

*** Output Tree: ***
PhyOp_Apply lookup TBL: @T2 (1) (x_jtInner)
    PhyOp_Range TBL: @T1
    PhyOp_Range TBL: @T2
    ScaOp_Comp x_cmpGt
        ScaOp_Identifier QCOL: [T2].c2
        ScaOp_Identifier QCOL: [T1].c1

The copy out tree is:

Inner Join Nested Loops (0) 
    [QCOL: [T1].c1 TI(int,Null,ML=4)]
    [QCOL: [T2].c2 TI(int,Null,ML=4)]
  Table Scan Table Scan (1) 
        [QCOL: [T1].c1 TI(int,Null,ML=4)]
  Index Seek Index Seek (2)  CP 
        [QCOL: [T2].c2 TI(int,Null,ML=4)]

Note that we started with SQL APPLY and had a logical apply in the input tree. This was transformed into a join, and later converted to a physical apply with index lookup during cost-based optimization.

Join Syntax Optimization

Starting instead with the JOIN SQL syntax:

SELECT
    T1.c1,
    T2.c2
FROM @T1 AS T1
JOIN @T2 AS T2
    ON T2.c2 > T1.c1;

The input tree features a cross join and relational selection, which is simplified to an inner join by SELonJN:

*** Input Tree: ***
LogOp_Project QCOL: [T1].c1 QCOL: [T2].c2
    LogOp_Select
    LogOp_Join
        LogOp_Get TBL: @T1
        LogOp_Get TBL: @T2
        ScaOp_Const TI(bit,ML=1) XVAR(bit,Value=1)
    ScaOp_Comp x_cmpGt
        ScaOp_Identifier QCOL: [T2].c2
        ScaOp_Identifier QCOL: [T1].c1
    AncOp_PrjList 

Rule applied: SEL JN -> JN

*** Simplified Tree: ***
LogOp_Join
    LogOp_Get TBL: @T1
    LogOp_Get TBL: @T2
    ScaOp_Comp x_cmpGt
    ScaOp_Identifier QCOL: [T2].c2
    ScaOp_Identifier QCOL: [T1].c1

This is exactly the same as the simplified tree for the APPLY syntax after its apply was transformed to a join by ApplyHandler. The remainder of the optimization process continues exactly as before, with the same physical apply plan chosen via the JNtoIdxLookup implementation rule.

Summary

The written form of the SQL query can have an impact on the initial logical tree passed into the compilation and optimization process, but the optimizer can often convert an apply into a join, and can pretty much always consider an apply alternative to a join.

Whether APPLY or JOIN syntax is used in the source query does not necessarily determine whether the execution plan uses an apply (nested loops join with outer references) or a regular loops join (predicate evaluated at the join).

An apply plan may be more intuitive, but the correlations (outer references) mean the only available physical implementation has to be based on nested loops. A join without correlations can be implemented by nested loops, hash match, and sort-merge. The optimizer explores the relative merits of apply and join as it tries to find a reasonable execution plan in a sensible amount of time.