Sunday, 31 May 2020

Pulling Group By Above a Join

One of the transformations available to the SQL Server query optimizer is pulling a logical Group By (and any associated aggregates) above a Join.

Visually, this means transforming a tree of logical operations from:

Group By Below Join

…to this:

Group By Above Join

The above diagrams are logical representations. They need to be implemented as physical operators to appear in an execution plan. The options are:

Group By
- Hash Match Aggregate
- Stream Aggregate
- Distinct Sort
Join
- Nested Loops Join
- Nested Loops Apply
- Hash Match Join
- Merge Join

When the optimizer moves a Group By above a Join it has to preserve the semantics. The new sequence of operations must be guaranteed to return the same results as the original in all possible circumstances.

One cannot just pick up a Group By and arbitrarily move it around the query tree without risking incorrect results.

Conditions

The optimizer can move a Group By above a Join using the rule GbAggAfterJoin. The only conditions that must be met to consider this transformation safely are:

The other input to the join must have a key.
The join predicate must not use any aggregates computed by the moving Group By.

When the optimizer pulls a Group By above a Join, it will choose between the alternatives using estimated costs.

Demo

Using the AdventureWorks OLTP sample database, the following query joins the Product table to a derived table containing a Group By aggregate:

SELECT P.[Name], J.*
FROM Production.Product AS P
JOIN
(
    SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
    FROM Production.TransactionHistory AS TH
    GROUP BY TH.ProductID, TH.Quantity
) AS J
    ON J.ProductID = P.ProductID
WHERE P.[Name] LIKE N'A%'
OPTION (FORCE ORDER);

The FORCE ORDER hint has two effects. First, it ensures the tables are physically joined in the order they appear in the query text. Second, it prevents the optimizer from reordering aggregates.

The resulting execution plan has an estimated cost of 3.64687 units:

Execution plan with FORCE ORDER

The Group By is below the join, implemented in this case with a Hash Match Aggregate physical operator. The FORCE ORDER hint prevented the optimizer from considering moving the Group By above the Join.

Without the hint

Removing the FORCE ORDER hint:

SELECT P.[Name], J.*
FROM Production.Product AS P
JOIN
(
    SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
    FROM Production.TransactionHistory AS TH
    GROUP BY TH.ProductID, TH.Quantity
) AS J
    ON J.ProductID = P.ProductID
WHERE P.[Name] LIKE N'A%';

…gives a plan with an estimated cost of 1.3409 units:

Natural execution plan

Analysis

The optimizer has pulled the Group By above the Join. Aggregating after the join is assessed to be cheaper because the join eliminates many rows. The join is slightly more expensive than before because it processes more rows on the probe side, but this difference is overwhelmed by the cheaper aggregate.

The optimizer chose to implement the logical Group By as a Stream Aggregate, which requires input sorted on the grouping keys. A Sort operator is introduced to meet that requirement. The combination of Sort plus Stream Aggregate is estimated to be slightly cheaper than a Hash Match Aggregate.

The optimizer was able to consider pulling the Group By above the Join because the Product table has a key, and the join predicate does not use any aggregates computed by the moving Group By.

Grouping Keys

The grouping keys used by the Stream Aggregate in the prior example are ProductID and Quantity. This matches the GROUP BY clause of the derived table in this case, but that is because I chose a deliberately simple example to get us started.

In general, pulling a Group By above a join requires grouping by a key from the other join input.

To see this in action, let’s use a temporary table to join from instead of the Product table. We will make the Name column the only key using a UNIQUE constraint:

CREATE TABLE #P 
(
    ProductID integer NOT NULL, 
    [Name] nvarchar(50) 
        COLLATE DATABASE_DEFAULT 
        NOT NULL 
        UNIQUE
);
GO
INSERT #P 
(
    ProductID, 
    [Name]
)
SELECT 
    P.ProductID, 
    P.[Name]
FROM Production.Product AS P;

Our query is now:

SELECT P.[Name], J.*
FROM #P AS P
JOIN
(
    SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
    FROM Production.TransactionHistory AS TH
    GROUP BY TH.ProductID, TH.Quantity
) AS J
    ON J.ProductID = P.ProductID
WHERE P.[Name] LIKE N'A%';

The execution plan is:

Temporary table keyed on Product Name

The Stream Aggregate now computes its COUNT(*) aggregate grouped by (Name, Quantity) instead of (ProductID, Quantity) as in the query text and previous plans.

This has the same semantics as the original query because the ProductID column is functionally determined by the Name key.

Getting every nuance of this sort of relational transformation correct can be tricky. It is very handy that the optimizer team put the effort in so we do not have to explore these tricky rewrites manually (e.g. by changing the query text). If nothing else, it would be extremely tedious to write all the different query forms out by hand just to see which one performed better in practice. Never mind choosing a different version depending on current statistics and the number of changes to the table.

No Key

What happens if the (non-grouping) table does not have a key? Will the optimizer still be able to pull the Group By above the inner join?

Let’s find out by recreating and repopulating the temporary table without any keys:

DROP TABLE IF EXISTS #P;
GO
CREATE TABLE #P 
(
    ProductID integer NOT NULL, 
    [Name] nvarchar(50) 
        COLLATE DATABASE_DEFAULT 
        NOT NULL
        -- No longer UNIQUE!
);
GO
INSERT #P 
(
    ProductID, 
    [Name]
)
SELECT 
    P.ProductID, 
    P.[Name]
FROM Production.Product AS P;

Our query:

SELECT P.[Name], J.*
FROM #P AS P
JOIN
(
    SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
    FROM Production.TransactionHistory AS TH
    GROUP BY TH.ProductID, TH.Quantity
) AS J
    ON J.ProductID = P.ProductID
WHERE P.[Name] LIKE N'A%';

…produces a very similar plan:

Temporary table with no key

With no key column(s) available from the table, the optimizer has chosen the only unique identifier available — the heap row locator (RID or “bookmark”). This is projected from the Table Scan and labelled Bmk1000 in this case. A bookmark must uniquely identify each row, and is therefore a suitable key to group by.

No Table

What if there isn’t even a table? What will SQL Server do for a key then?

The following rewrite of the original query replaces the Product table and LIKE N'A%' predicate with a hard-coded list of values:

SELECT P.[Name], J.*
FROM
(
    VALUES
        (1, N'Adjustable Race'),
        (879, N'All-Purpose Bike Stand'),
        (712, N'AWC Logo Cap')
)
AS P (ProductID, [Name])
JOIN
(
    SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
    FROM Production.TransactionHistory AS TH
    GROUP BY TH.ProductID, TH.Quantity
) AS J
    ON J.ProductID = P.ProductID;

The execution plan has a familiar shape:

Plan with VALUES list

The Constant Scan is an in-memory “table” of values, but it has no row locator since it is not a physical table.

The optimizer found a key by inspecting the VALUES clause! The two columns are projected as Union1006 and Union1007. The optimizer can see that the literal values we entered do not contain any duplicates, so it could choose either column (ProductID or Name) as a key. Neat!

No Table and No Key

Let’s see what happens if we introduce a duplicate in the VALUES clause so there is no longer a key of any kind:

SELECT P.[Name], J.*
FROM
(
    VALUES
        (1, N'Adjustable Race'),
        (879, N'All-Purpose Bike Stand'),
        (879, N'All-Purpose Bike Stand'), -- DUPLICATE!
        (712, N'AWC Logo Cap')

)
AS P (ProductID, [Name])
JOIN
(
    SELECT TH.ProductID, TH.Quantity, NumRows = COUNT_BIG(*)
    FROM Production.TransactionHistory AS TH
    GROUP BY TH.ProductID, TH.Quantity
) AS J
    ON J.ProductID = P.ProductID;

The plan is:

VALUES list with duplicates

With both columns in the VALUES clause containing duplicates, the optimizer has one final card to play. It resorts to manufacturing a key as part of the GbAggAfterJoin exploration.

The manufactured key is named I8Rank1013 in this instance. It refers to the result of a ROW_NUMBER calculation performed by the Segment and Sequence Project operators.

Since we deprived it of any other key source, the optimizer uniquely numbered the rows defined in the VALUES clause. How cool is that?

Thanks for reading.