Wednesday, 28 July 2010

Partitioning and the Common Subexpression Spool

Introduction

SQL Server 2005 introduced the OVER clause to enable partitioning of rowsets before applying a window function. This post looks at how this feature may require a query plan containing a ‘common subexpression spool’. This query plan construction is required whenever an aggregate window function or the NTILE ranking window function is used.

Example

To illustrate, here is a simple query based on the AdventureWorks sample database.

The AdventureWorks product warehouse is organised into shelves, with multiple bins per shelf. Each bin can hold several different products. We have been asked to produce a report with the following (partial) output:

Sample output

Notice that the total_in_bin column contains the sum of product quantities, partitioned by shelf and bin. We can meet the requirements using a query featuring the OVER clause:

SELECT
    INV.Shelf,
    INV.Bin,
    INV.ProductID,
    INV.Quantity,
    total_in_bin = 
        SUM(INV.Quantity) OVER (
            PARTITION BY INV.Shelf, INV.Bin)
FROM Production.ProductInventory AS INV
WHERE
    INV.Shelf BETWEEN 'A' AND 'C'
ORDER BY
    INV.Shelf,
    INV.Bin;

That produces the correct results, with this interesting query plan:

Execution plan

A Special Kind of Spool

The spools shown in the above plan are very similar to those you will find in wide (per-index) update plans, where a set of data changes is saved once, and replayed to multiple single-index-updating consumers. This kind of spool is known as a Common Subexpression Spool, and the particular variety here is called a Segment Spool. A Segment Spool co-operates with its child Segment iterator to save and replay one group at a time, as defined by the PARTITION BY clause.

The Segment Spool iterator always appears as the immediate parent of a Segment iterator. The two leaf-level Table Spool iterators shown in the plan are secondary spools, which only replay rows saved by the primary spool. The way this query plan produces its results is perhaps not entirely intuitive (until you are familiar with the pattern) so we’ll explore it a step at a time.

The Execution Plan

The Segment iterator (see my previous post) detects when a row arriving from the Index Seek belongs to a new group (as specified in the PARTITION BY clause).

The Segment adds a new column to the row that is used as a flag: true if the current row represents the first in a group, false otherwise. The new column has an auto-generated name; in my case it was named [Segment1003] and I’ll refer to it by that label from here on in.

Like all spools, the Segment Spool saves a copy of the rows it receives to a worktable backed by tempdb. Unlike a regular spool though, the saved rows are available to multiple consumers, and it processes a group at a time, clearing its worktable between groups.

The Segment Spool lazily writes rows into its worktable, until the start of a new group is signalled. Once the Segment Spool has a complete group in its worktable, one row (not the whole group) is returned to its parent (the top-level Nested Loops operator in this example).

The data values stored in this one row are not important; they do not contribute to the final result. The point is that this single row is received on the outer input of the parent Nested Loops iterator. This causes the iterator to execute its inner input once per group.

Per Group Processing

Executing the inner input causes the spooled rows (one complete group, remember) to be replayed into the Stream Aggregate, which computes the SUM of values in the Quantity column for the group.

This gives us the SUM(Quantity) value we need for the current group. The problem is that the Stream Aggregate is destructive; the rows that contributed to the sum are no longer present in the flow, so we can’t just add the result column directly.

The solution is to replay the saved group rows for a second time, from the secondary Table Spool at the lowest level of the plan. The replayed rows are joined with the single column result of the Stream Aggregate, producing the original rows plus a column containing the group sum (in every row) - exactly what is needed.

This set of rows is passed back up to the top-level Nested Loops operator, which joins it with the ‘dummy’ row produced by the Segment Spool to begin with.

This process is repeated for every group, building up the final output a group at a time. It’s important to realise that it is the top-level Nested Loops iterator that builds the final output—the Segment Spool clears its worktable between groups.

The Final Group

You might be wondering how the Segment Spool ensures that the final group is processed. There is no first row of a following group to set the [Segment1003] flag, so how does this last group end up in the query output?

The answer is that the Segment Spool emits an extra dummy row when it reaches the end of its input. This extra row triggers the Nested Loops iterator to execute its inner input one last time, so picking up the final group.

There are some of interesting side-effects to this arrangement:

In general, the Index Seek produces rows which partition into n groups, so the extra dummy row means that the Segment Spool will pass n+1 rows to its parent.
The iterators on the second level of the plan (the level that contains the Stream Aggregate) will also execute n+1 times.
The Stream Aggregate will produce n rows, and the lowest secondary spool will also execute n times.

This is particularly interesting when the Index Seek produces no rows at all (n = 0). The Segment iterator also produces no rows, but the Segment Spool produces one row, and the second-level plan iterators also execute once as a result.

You can demonstrate this for yourself by running the sample query for shelf ‘Z’ and examining the actual row counts between iterators.

RANK, DENSE_RANK & NTILE and
Optimised per-index maintenance plans both by Craig Freedman