Monday, 12 March 2012

Fun with Scalar and Vector Aggregates

Introduction

There are interesting things to be learned from even the simplest queries.

For example, imagine you are asked to write a query that lists AdventureWorks product names, where the product has at least one entry in the transaction history table, but fewer than ten. One query that meets that specification is:

SELECT
    P.[Name]
FROM Production.Product AS P
JOIN Production.TransactionHistory AS TH
    ON P.ProductID = TH.ProductID
GROUP BY
    P.ProductID,
    P.[Name]
HAVING
    COUNT_BIG(*) < 10;

The query correctly returns 23 rows. The execution plan and a sample of the data returned is shown below:

Product name query

Execution Plan

The execution plan looks a bit different from the written form of the query. The base tables are accessed in reverse order, and the aggregation is performed before the join.

The general strategy in the plan is to read all rows from the TransactionHistory table, compute the count of rows grouped by ProductID, merge join that result with the Product table on ProductID, and finally Filter to only return rows where the count is less than ten.

This ‘fully-optimized’ plan has an estimated cost of around 0.365 units.

The reason for the quote marks there is that this plan is not quite as optimal as it could be—surely it would make sense to push the Filter down past the join too?

Alternate Formulations

To answer that, let’s look at some other ways to formulate this query. This being SQL, there are any number of ways to write a logically-equivalent query, so we’ll just look at a couple of interesting ones.

Pre-aggregation

The first query is an attempt to reverse-engineer T-SQL from the execution plan shown previously. It joins the result of pre-aggregating the TransactionHistory table to the Product table before filtering:

SELECT
    P.[Name]
FROM
(
    SELECT 
        TH.ProductID, 
        cnt = COUNT_BIG(*) 
    FROM Production.TransactionHistory AS TH
    GROUP BY
        TH.ProductID
) AS Q1
JOIN Production.Product AS P
    ON P.ProductID = Q1.ProductID
WHERE
    Q1.cnt < 10;

Perhaps surprisingly, we get a different execution plan:

Different plan

The results are the same (23 rows) but this time the Filter is pushed below the join!

The optimizer chooses Nested Loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual). Even so, forcing a Merge Join with an OPTION (MERGE JOIN) hint still gives a plan with the Filter below the join.

Using `HAVING`

In another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the Q1 sub-query instead of in the WHERE clause:

SELECT
    P.[Name]
FROM
(
    SELECT 
        TH.ProductID, 
        cnt = COUNT_BIG(*) 
    FROM Production.TransactionHistory AS TH
    GROUP BY
        TH.ProductID
    HAVING COUNT_BIG(*) < 10
) AS Q1
JOIN Production.Product AS P
    ON P.ProductID = Q1.ProductID;

The reason this predicate can be pushed past the join in this query form, but not in the original formulation, is simply an optimizer limitation—it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not comprehensive.

Example Two

The new specification is:

List the products where there exists fewer than ten correlated rows in the history table”:

A first attempt:

SELECT
    P.[Name]
FROM Production.Product AS P
WHERE EXISTS 
(
    SELECT
        *
    FROM Production.TransactionHistory AS TH
    WHERE
        TH.ProductID = P.ProductID 
    HAVING 
        COUNT_BIG(*) < 10
);

Unfortunately, this query produces an incorrect result (86 rows):

EXISTS plan

The problem with this query is it lists products with no history rows, though the reasons behind that are interesting.

The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set. For the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL when presented with an empty set).

Aggregating an empty set

To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist:

-- Scalar aggregate
SELECT
    COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE
    TH.ProductID = 709;

-- Vector aggregate
SELECT
    COUNT_BIG(*)
FROM Production.TransactionHistory AS TH
WHERE
    TH.ProductID = 709
GROUP BY
    TH.ProductID;

The estimated execution plans for these two statements are almost identical:

Count of history rows for product 709

You might expect the Stream Aggregate to have a Group By property for the second statement, but this is not the case. The query includes ProductID = 709, so all matching rows are guaranteed to have the same value for ProductID. The optimizer recognises this, and the GROUP BY TH.ProductID is optimized away.

There are some other minor differences between the two plans. The first qualifies for trivial plan compilation, and is subject to simple parameterization. The second plan requires cost-based optimization, and is not parameterized by SQL Server.

Still, there is nothing to indicate that one query uses a scalar aggregate and the other a vector aggregate. This is something I would like to see exposed in executions plan so I suggested it on Connect (archive link). It was closed as Won’t Fix, though they did say:

Thank you for the suggestion regarding show plan. We’ll consider it for a future release.

Anyway, running the two queries shows the difference quite clearly:

Scalar aggregate versus vector aggregate

The scalar aggregate (no GROUP BY) returns a result of zero, while the vector aggregate (with a GROUP BY clause) returns nothing at all.

Fixing the `EXISTS` Query

Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero:

SELECT
    P.[Name]
FROM Production.Product AS P
WHERE EXISTS 
(
    SELECT
        *
    FROM Production.TransactionHistory AS TH
    WHERE
        TH.ProductID = P.ProductID 
    HAVING
        COUNT_BIG(*) BETWEEN 1 AND 9
);

The query now returns the correct 23 rows:

Fixed execution plan

Unfortunately, the execution plan is less efficient now—it has an estimated cost of 0.885 compared to 0.365 for the earlier plan.

Redundant grouping

Let’s try adding a redundant GROUP BY instead of changing the HAVING clause:

SELECT
    P.[Name]
FROM Production.Product AS P
WHERE EXISTS 
(
    SELECT
        *
    FROM Production.TransactionHistory AS TH
    WHERE
        TH.ProductID = P.ProductID
    GROUP BY
        TH.ProductID
    HAVING 
        COUNT_BIG(*) < 10
);

We now get correct results (23 rows), and this is the execution plan:

Execution plan

I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly! The addition of a redundant GROUP BY clause has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier.

Group by nothing

What’s more (in SQL Server 2008 and later), we can replace the odd-looking redundant GROUP BY with an explicit GROUP BY on the empty set:

SELECT
    P.[Name]
FROM Production.Product AS P
WHERE EXISTS 
(
    SELECT
        *
    FROM Production.TransactionHistory AS TH
    WHERE
        TH.ProductID = P.ProductID
    GROUP BY
        ()
    HAVING 
        COUNT_BIG(*) < 10
);

I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too). Whichever way you prefer, it’s rather satisfying to note that the result of the subquery does not exist for a particular correlated value where a vector aggregate is used. Remember, a scalar COUNT aggregate always returns a value, even if that value is zero, so the subquery always produces a result, regardless of the ProductID being evaluated.

The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more):

`WHERE` Clause

SELECT
    P.[Name]
FROM Production.Product AS P
WHERE 
(
    SELECT
        COUNT_BIG(*) 
    FROM Production.TransactionHistory AS TH
    WHERE
        TH.ProductID = P.ProductID
    GROUP BY 
        ()
) < 10;

`APPLY`

SELECT
    P.[Name]
FROM Production.Product AS P
CROSS APPLY
(
    SELECT
        NULL
    FROM Production.TransactionHistory AS TH
    WHERE
        TH.ProductID = P.ProductID
    GROUP BY
        ()
    HAVING
        COUNT_BIG(*) < 10
) AS CA (Dummy);

Subquery

SELECT
    Q1.[Name]
FROM
(
    SELECT
        P.[Name],
        cnt = 
        (
            SELECT 
                COUNT_BIG(*) 
            FROM Production.TransactionHistory AS TH
            WHERE
                TH.ProductID = P.ProductID 
            GROUP BY
                ()
        )
    FROM Production.Product AS P
) AS Q1
WHERE
    Q1.cnt < 10;

`SUM`

This last example uses SUM(1) instead of COUNT and does not require a vector aggregate:

SELECT
    Q.[Name]
FROM
(
    SELECT
        P.[Name],
        cnt = 
        (
            SELECT
                SUM(1)
            FROM Production.TransactionHistory AS TH
            WHERE
                TH.ProductID = P.ProductID
        )
    FROM Production.Product AS P
) AS Q
WHERE
    Q.cnt < 10;

The execution plan is:

SUM(1) plan

Remember the SUM of no items (the empty set) is NULL. When the cnt subquery evaluates to NULL, the predicate Q.cnt < 10 will return UNKNOWN, not TRUE. The net result is that products with no history rows will not be returned.

Final thoughts

The semantics of SQL Server aggregates are rather odd in places. It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates.

As we have seen, query plans do not show in which ‘mode’ an aggregate is running. Getting it wrong can cause poor performance, wrong results, or both.

Thanks for reading.