Tuesday, 19 October 2010

Sequence Tables

Introduction

It is frequently useful to generate sequences of values within SQL Server, perhaps for use as surrogate keys. Using the IDENTITY property on a column is the easiest way to automatically generate such sequences:

CREATE TABLE dbo.SomeTable
(
    row_id integer IDENTITY PRIMARY KEY,
    [data] sql_variant NOT NULL,
);

Sometimes though, the database designer needs a more flexible scheme than that provided by the IDENTITY property. One alternative is to use a Sequence Table.

What is a Sequence Table?

A Sequence Table is a regular table that stores the next available value for one or more sequences––somewhat like a manual identity value in many respects.

The definition of a Sequence Table might look like this:

CREATE TABLE dbo.SequenceTable
(
    sequence_name nvarchar(20) PRIMARY KEY,
    next_value integer NOT NULL,
);

The table contains one row for each sequence range under management. One Sequence Table can maintain any number of independent sequences. Typically, though, the table only contains a few tens of rows.

The first column represents the sequence name, and the second represents the next unallocated value in the range. A sequence is initialized with a statement like:

INSERT dbo.SequenceTable
    (sequence_name, next_value)
VALUES
    (N'Test Sequence', 1);

The values obtained from the sequence table can be used for any purpose, but the most common is to use them as key values for rows in other tables.

Values are allocated from the sequence through a stored procedure that encapsulates the logic required to extend the sequence, and provides an easy way to manage permissions. Generally, users have no permissions on the Sequence Table itself, to protect the integrity of the sequence.

The Allocation Stored Procedure

The most common use of a Sequence Table is in pre-allocating a range of keys, to make a batch process more efficient.

A process that needs to add a large number of new rows to a table can be made to perform better by batching the requests, and submitting each batch in a single transaction. To make this work, we need to know the key values in advance.

Say we want to reserve a contiguous range of 250 keys from the Test Sequence defined above. We might call a stored procedure like this:

DECLARE
    -- Procedure return code
    @RC integer,
    -- The first key value allocated to the caller
    @Value integer;

-- Call the procedure to allocate
-- a range of 250 contiguous keys
EXECUTE @RC = dbo.Allocate_TSQL
    @SequenceName = N'Test Sequence',
    @RangeSize = 250,
    @FirstAllocated = @Value OUTPUT;

The procedure returns the first sequence value reserved, so if we receive the value 1 back, we can safely use key values from 1 to 250 inclusive, without calling the procedure again. No other process can obtain key values from our reserved range.

After this procedure call, the Sequence Table looks like this:

An implementation of the procedure itself might look like this:

CREATE PROCEDURE dbo.Allocate_TSQL
    -- The name of the sequence to allocate keys from
    @SequenceName   nvarchar(20),
    -- The number of keys to allocate
    @RangeSize      integer = 1,
    -- The first key allocated (output)
    @FirstAllocated integer OUTPUT
AS
BEGIN
    -- Most errors will abort the batch
    SET XACT_ABORT ON;
    -- Supress 'x row(s) affected' messages
    SET NOCOUNT ON;
    -- Reset rowcount
    SET ROWCOUNT 0;

    -- Validate the range size requested
    IF (@RangeSize IS NULL OR @RangeSize < 1)
    BEGIN
            RAISERROR
            (
                '@RangeSize must be a positive integer (found %i)',
                16, 1,
                @RangeSize
            );
            RETURN  999;
    END;

    -- Initialize the output parameter
    SET @FirstAllocated = NULL;

    -- Update the row associated with @SequenceName, returning the 
    -- current value, and then incrementing it by @RangeSize

    UPDATE dbo.SequenceTable
        WITH (READCOMMITTEDLOCK)
    SET @FirstAllocated = next_value,
        next_value = next_value + @RangeSize
    WHERE
        sequence_name = @SequenceName;

    -- If @Allocated has a non-NULL value, we know we successfully updated a row
    RETURN
        CASE
            WHEN (@FirstAllocated IS NOT NULL) 
            THEN 0 
            ELSE -999 
        END;
END;

The important part of the procedure is the UPDATE statement:

The output parameter @FirstAllocated is set to the pre-update value of the next_value column.
The next_value column then has @RangeSize added to its value, thereby reserving a contiguous range of values for the procedure caller.

As a reminder, recall that the caller checks the value returned in @FirstAllocated knowing that the next @RangeSize keys after that have been reserved exclusively for their use.

Where to Use a Sequence Table

Generally, you will want to use an identity column, if it fits the application requirements. Only consider a Sequence Table where some aspect of the IDENTITY property makes it unsuitable.

A Sequence Table might be the right choice where you need:

To generate a custom sequence of a type not directly supported by identity columns
To maintain a sequence of key values over more than one table
A guarantee that a multi-row insert will receive a contiguous block of sequence values
To pre-allocate a range of sequence values for performance reasons

Cases 1 and 2 are perhaps obvious. Between them, they cover custom sequences that cannot conveniently be implemented with an identity column.

Case 3 is quite interesting. If concurrent queries insert multiple rows to the same table with an identity column, SQL Server does not guarantee that the identity values allocated will be contiguous for each caller. In fact, the asynchronous nature of identity value allocations means that the values assigned will typically not be contiguous. This can be important to some applications, and there is more about this later on, in the section on concurrency.

Case 4 covers a class of question often seen on the forums. The identity value assigned to a row is only available after the insert has completed. Frequently, it would be more efficient to know the value before the insert is performed.

Consider a process which must add a number of related objects to the database. One option is to insert one object at a time, retrieve the identity value, and then use that value in subsequent inserts. If the application were able to pre-allocate the values, the insert operations could all be performed in a single batch, and a single transaction. This could prove to have very significant performance benefits.

This article is concerned primarily with case 4 (pre-allocation of ranges). The Sequence Table method can be used to solve problems in all four cases.

Concurrency and Deadlocking

A common objection to using the Sequence Table design is that it might encounter concurrency problems.

Although the process of allocating a new range is atomic and pretty fast, like any data modification it acquires an exclusive row-level lock on the Sequence Table, which will be held until the containing transaction completes.

Imagine a process (A) starting a transaction and requesting a new range of key values from the database via the Sequence Table allocation stored procedure:

The allocation completes very quickly (typically in a fraction of a millisecond), but process (A) goes on to perform another operation before committing its transaction.
As luck would have it, this second operation encounters a delay of some sort. Process (A) continues to hold an exclusive lock on a Sequence Table row until it completes its transaction.
Meanwhile, a second process (B) also requests a range of key values, for the same sequence name as process (A). This request needs to update the same row that process (A) has locked exclusively.
Process B is therefore blocked until process A completes its transaction. This limitation means that solutions based on a Sequence Table historically often suffered from an unacceptable level of deadlocks.

In the example above, if process B already held a lock which process A must wait to acquire, we have a deadlock—neither process can make progress since it is waiting for a lock which is incompatible with one already held by the opposing process.

How SQL Server does it

At this point, you might be wondering how SQL Server avoids this sort of problem when working with identity columns. How does SQL Server avoid these kinds of concurrency and deadlocking problems when users add multiple rows concurrently to the same table?

The answer is that SQL Server allocates identity values using an internal function that does not participate in user transactions. This is the reason that the current identity value for a table is not reset when a user transaction rolls back – the change in identity value was not part of the aborted transaction. Another way to look at this is to say that the allocation of identity values is performed using an ‘autonomous’ transaction – one that exists independently of any user transaction.

SQL Server does briefly lock an internal resource while updating the current identity value, but the lock is transient and held just long enough to prevent data corruption. The autonomous transaction is therefore very fast, and does not cause concurrency problems.

This autonomous behaviour is also the reason that two users concurrently executing a multi-row insert to the same table typically do not receive a contiguous range of identity values. Because identity allocations do not block, a single INSERT statement can be given multiple small ranges of identity values, which do not form a single contiguous sequence.

It appears that the way to avoid concurrency problems is to use an autonomous transaction. Unfortunately, SQL Server does not currently offer this facility – but it is possible to simulate it. The next section examines this in detail, since it is essential for a robust implementation of a Sequence Table.

Simulating Autonomous Transactions

The goal here is to find a way to allocate a range from the Sequence Table without holding long-term locks. This implies that we need to acquire our range in the context of a separate transaction. The most obvious way to achieve this is to use a separate database connection.

In SQL Server 2000, we could write an extended stored procedure to make a new connection to the database and update the Sequence Table on that connection. The exclusive lock on the Sequence Table row is held for a very short time on the second connection, and is unaffected by the open transaction on the first connection. The downside to this is that extended stored procedures are difficult to write, and may affect the stability of the SQL Server instance. Extended stored procedures were deprecated with the release of SQL Server 2005.

From SQL Server 2005, a CLR stored procedure can be used as a direct replacement for an extended stored procedure. The CLR hosted environment makes this option very safe to use within SQL Server, and accessible to anyone with some .NET coding skills.

In SQL Server 2008, there is a new way to achieve autonomous transactions in pure T-SQL. The method is fully documented and supported, and slightly outperforms the equivalent CLR implementation.

SQL Server 2005

A full implementation of a CLR stored procedure is included in the script at the end of this article. Here, we will just take a quick look at the main features of this solution.

The CLR stored procedure needs to make a new connection to the SQL Server instance, call the T-SQL allocation stored procedure on that connection, and then return the results. The first problem is that the CLR procedure needs to know the name of the SQL Server instance it was called from, in order to create a new connection to it.

We can find the information we need using the context connection—a built-in feature, which allows the CLR procedure to talk back to the database via the connection it was called on. It is important to note that we cannot use the context connection to call the T-SQL allocation stored procedure, since the context connection is associated with the still-open original transaction. If we did that, we would not solve any of the concurrency or deadlocking problems.

The code to get the name of the SQL Server instance we need to connect to looks like this:

// Use the context connection to discover the name of the calling SQL Server
using (SqlConnection conn = new SqlConnection("context connection=true"))
{
    conn.Open();
    using (SqlCommand cmd =
        new SqlCommand("SELECT SERVERPROPERTY(N'ServerName')", conn))
        {
        serverName = (string)cmd.ExecuteScalar();
        }
}

Now that we have the name of the SQL Server instance, we can make a new connection to it:

// Construct the connection string for the non-context connection
SqlConnectionStringBuilder csb = new SqlConnectionStringBuilder();
csb.DataSource = serverName;
csb.IntegratedSecurity = true;
csb.Enlist = false;

// Use a separate transaction scope to allocate the range
using (SqlConnection conn = new SqlConnection(csb.ConnectionString))

It is important to specify (enlist = false) in the connection string, to ensure that the new connection has no link to the original transaction. The rest of the code in the CLR stored procedure is just concerned with calling the original allocation routine and returning the results.

At first look, it seems as if this method might be inefficient. Each allocation from the Sequence Table requires a call back to the server to find its name, and the creation of a new connection, in addition to the cost of calling the Sequence Table’s allocation procedure. In practice though, the process is quite fast. The context connection already exists, and the new connection created is usually allocated from a pool. The whole process typically takes around 1ms to complete.

This method works well, but does require that CLR integration be enabled.

SQL Server 2008

In this release, SQL Server introduced a new option for linked servers, which allows a purely T-SQL approach to autonomous transactions. To use this new feature, we need to create a loopback linked-server – a linked server that points back to the same server it is created on.

We can then use the sp_serveroption system stored procedure, to set the new option remote proc transaction promotion to FALSE (or OFF). This setting prevents a local transaction from being promoted to a distributed transaction when making a call to the loopback linked-server.

The effect is that of an autonomous transaction – exactly what we need. Further details on this method can be found in a blog entry by the SQL Programmability & API Development Team. Again, a full implementation of this method can be found at the end of this article.

We will now look at the essential features of this approach. First, we need to create a loopback linked-server, and set the options. The following code snippet takes care of that:

-- Create the loopback linked server
DECLARE @servername sysname;

SET @servername = CONVERT(sysname, SERVERPROPERTY(N'ServerName'));

EXECUTE sys.sp_addlinkedserver
    @server = N'loopback', 
    @srvproduct = N'', 
    @provider = N'SQLNCLI', 
    @datasrc = @servername;

EXECUTE sys.sp_serveroption 
    @server = N'loopback', 
    @optname = 'RPC OUT', 
    @optvalue = 'ON';

-- Prevent a local transaction esclating to a distributed transaction (SQL Server 2008 only)
EXECUTE sys.sp_serveroption 
    @server = N'loopback', 
    @optname = 'remote proc transaction promotion',
    @optvalue = 'OFF';

Calling the allocation procedure is now very simple:

-- Call the procedure to allocate a range of 250 contiguous keys
EXECUTE @RC = loopback.tempdb.dbo.Allocate_TSQL
    @SequenceName = N'Test Sequence',
    @RangeSize = 250,
    @FirstAllocated = @Value OUTPUT;

Using this method, we can take full advantage of the flexibility the Sequence Table offers, while avoiding concurrency issues and deadlocks. Calling the allocation procedure in this way is typically a little faster than using the CLR implementation.

Testing and Results

A full test rig is included at the end. The script contains all the code necessary to test the behaviour of each of the three methods presented in this article: the original stored procedure, the CLR stored procedure, and the method using a loopback linked server. You will need a copy of SQL Server 2008 or later for best results. The test rig is written so that the SQL Server 2008 method will still work when run on SQL Server 2005, but you will not get the benefits of an autonomous transaction.

When run on SQL Server 2005, the loopback linked server method will hold long-term locks, and will be very much slower. The CLR stored procedure works well for both SQL Server 2005 and 2008 users.

The tests create a single test sequence and allocate a number of key values using each method, displaying the locks held by each open transaction.

The transaction wrapping each test is rolled back at the end, to show the effect on the Sequence Table. It is important to note that the non-blocking (CLR and loopback server) methods behave like an identity column in that the next value in the sequence is not affected by a roll back.

The output is reproduced and discussed below.

Original Allocation Stored Procedure

This implementation takes no special steps to avoid holding long-term locks. The locks held after allocating values from the test sequence are:

Locks held after allocation

An exclusive row-level lock is held on the Sequence Table, together with the normal higher-level intent-exclusive locks. This row-level lock will prevent other concurrent sessions allocating from the same sequence until the transaction commits or is rolled back.

This method is the fastest, but will exhibit poor concurrency, and may result in a high level of deadlocks, depending on the design of the calling application. This method is therefore not recommended.

2005 CLR Stored Procedure

This implementation uses a separate external connection to the database to avoid holding long-term locks. The locks held by this procedure are:

Locks held after CLR allocation

The only locks held are schema-stability locks associated with the CLR assembly. These locks prevent the assembly from being dropped or modified while it is in use, and do not affect Sequence Table concurrency or deadlocking.

The important fact is that there are no locks held on the Sequence Table itself and therefore no concurrency or deadlocking problems can occur.

This method is well suited to pre-allocating ranges of keys for an application.

2008 Loopback Linked Server

This method holds no locks at all after the allocation is made, confirming the autonomous transaction behaviour we were seeking. This method is slightly faster than the CLR solution.

Performance

On a test run of 10,000 (arbitrarily large) range allocations, the CLR method averages around 2ms per call, and the loopback server method averages around 1ms per call.

It is important to note though, that both methods are slower than using a column with the IDENTITY property. If you can use a simple identity column, you probably should.

Other Considerations

Page Splits: Pre-allocating key values can be an excellent performance optimization. If, however, you were relying on an identity column to control index page splits, you should be careful to consider the implications of switching to a Sequence Table method.

In general, the potential changes in page-splitting behaviour are not a great concern. As always though, it is important to test any proposed solution thoroughly, and the Sequence Table method is no exception to that rule.

Alternatives

Other potential solutions to the wider problem include using sequential GUIDs, the OUTPUT clause, or maintaining key sequences outside SQL Server.

GUIDs are relatively wide for use as keys, and some people find them less convenient to work with, compared to integers. This method also requires a good way of producing them in sequential order. Of course, there are scenarios where it makes sense to use GUIDs, sequential or otherwise.

Using the OUTPUT clause to return allocated identity values is can be problematic in batch scenarios. Using OUTPUT does not guarantee that the range of values assigned will be contiguous. SQL Server also does not typically return rows from the OUTPUT clause in any defined order. This leads on to problems mapping the returned identity values to the rows submitted for insertion. Nevertheless, there are occasions where this method can be made to work well.

You might also consider maintaining key values outside of SQL Server, possibly in a mid-tier component. One disadvantage is that the key values are then not available to any process running inside SQL Server. It might also be argued that key allocation, as a data entity, logically belongs in the database layer.

Conclusion

Properly implemented, a Sequence Table is a high-performing and robust solution for appropriate applications running on SQL Server 2005 or later. Anyone with a need to pre-allocate key values or to maintain a custom sequence should give a Sequence Table serious consideration.

Thanks for reading.

Scripts

These are now available on GitHub.