Wednesday, 02 August 2023

Importing a File in Batches

There are a million ways to import data into SQL Server. Most of the time, we want to ingest the new data as quickly and efficiently possible but that’s not always the case. Sometimes, we need to accept data at a rate that will not dominate resource usage on the target system or cause excessive transaction log growth. In other cases, each row from the data source needs specific server-side processing to validate and persist the data across multiple relational tables, perhaps involving foreign keys and identity columns.

All this can be achieved with client-side tools and programming. It can also be done server-side by importing the raw data into a staging table before processing using T-SQL procedures. Other times, the need arises to ingest data without using client-side tools and without making a complete copy of the raw data on the server. This article describes one possible approach in that situation.

Example Data Export

I’m going to use the Product and TransactionHistory tables from the AdventureWorks sample database. Each product and its associated transaction history will be serialized in JSON format.

To make the data extract easier, create the following helper view:

CREATE VIEW dbo.ProductTransactionExport
WITH SCHEMABINDING
AS
SELECT
    Product =
    (
        SELECT
            P.[Name],
            P.ProductNumber,
            P.Color,
            P.Size
        FOR JSON PATH
    ),
    History =
    (
        SELECT
            TH.ReferenceOrderID,
            TH.ReferenceOrderLineID,
            TH.TransactionDate,
            TH.TransactionType,
            TH.Quantity,
            TH.ActualCost
        FROM Production.TransactionHistory AS TH
        WHERE
            TH.ProductID = P.ProductID
        FOR JSON PATH
    )
FROM Production.Product AS P;

Export the data to a convenient location using bcp:

bcp AdventureWorks2022.dbo.ProductTransactionExport out
    ProductTransactionExport.dat -w -q -S .\SQL2022 -T

This creates a 32.1 MB wide character export file in a second or two.

Target Tables

I’m going to import the extracted data into two new tables. The target Product table has an identity column missing from the source data. The target TransactionHistory table has a foreign key to that identity column:

CREATE TABLE dbo.Product
(
    ProductID integer IDENTITY NOT NULL PRIMARY KEY, -- new
    [Name] nvarchar(50) NOT NULL, 
    ProductNumber nvarchar(25) NOT NULL, 
    Color nvarchar(15) NULL, 
    Size nvarchar(5) NULL
);
GO
CREATE TABLE dbo.TransactionHistory
(
    ProductID integer NOT NULL
        REFERENCES dbo.Product, -- foreign key
    ReferenceOrderID integer NOT NULL,
    ReferenceOrderLineID integer NOT NULL,
    TransactionDate datetime NOT NULL,
    TransactionType nchar(1) NOT NULL,
    Quantity integer NOT NULL,
    ActualCost money NOT NULL
);

The new tables are in the dbo schema. The original tables remain as they were in the Production schema.

Data Import

The export file contains two columns of JSON-formatted data. The first step of the import process is to create a view with a compatible structure:

CREATE VIEW dbo.ProductTransactionImport
WITH SCHEMABINDING
AS
SELECT
    Product = CONVERT(nvarchar(max), N'')
        COLLATE Latin1_General_100_BIN2,
    History = CONVERT(nvarchar(max), NULL)
        COLLATE Latin1_General_100_BIN2;

Data can’t be imported into this view directly of course. If we try, SQL Server returns an error message:

Msg 4406, Level 16, State 1
Update or insert of view or function 'dbo.ProductTransactionImport' 
failed because it contains a derived or constant field.

To insert data via the view, we need an INSTEAD OF INSERT trigger.

The Trigger

To keep things simple to start with, the following trigger processes the import one row at a time.

For each row, the code:

Unpacks the JSON-formatted Product data and writes it to the target table.
If related TransactionHistory records are present:
- Save the product identity value.
- Unpack the JSON-formatted history records.
- Store the history records using the saved identity value.

The code is quite straightforward:

CREATE OR ALTER TRIGGER ProductTransactionImport_IOI
ON dbo.ProductTransactionImport
INSTEAD OF INSERT
AS
IF ROWCOUNT_BIG() = 0 RETURN;
SET NOCOUNT, XACT_ABORT ON;
SET STATISTICS XML OFF;

DECLARE 
    @ProductID integer,
    @Product nvarchar(max),
    @History nvarchar(max);

-- Top (2) allows us to detect multi-row inserts efficiently
SELECT TOP (2)
    @Product = I.Product, 
    @History = I.History 
FROM Inserted AS I;

IF ROWCOUNT_BIG() > 1 
BEGIN 
    THROW 50000, 'Only single row inserts are allowed.', 1;
END;

-- Add the product record
INSERT dbo.Product 
    ([Name], ProductNumber, Color, Size)
SELECT
    P.*
FROM OPENJSON(@Product COLLATE Latin1_General_100_BIN2)
    WITH 
    (
        [Name] nvarchar(50), 
        ProductNumber nvarchar(25), 
        Color nvarchar(15), 
        Size nvarchar(5)
    ) AS P;

IF @History IS NOT NULL
BEGIN
    SET @ProductID = SCOPE_IDENTITY();

    -- Add the related history rows
    INSERT dbo.TransactionHistory
    (
        ProductID, 
        ReferenceOrderID, 
        ReferenceOrderLineID, 
        TransactionDate, 
        TransactionType, 
        Quantity, 
        ActualCost
    )
    SELECT
        @ProductID,
        H.*
    FROM OPENJSON(@History COLLATE Latin1_General_100_BIN2)
        WITH 
        (
            ReferenceOrderID integer,
            ReferenceOrderLineID integer,
            TransactionDate datetime,
            TransactionType nchar(1),
            Quantity integer,
            ActualCost money
        ) AS H;
END;

Performing the Import

We can import the file in batches using BULK INSERT. The trigger is currently only set up to handle a single row at a time, so we’ll start with single-row batches:

SET NOCOUNT ON;
SET STATISTICS XML OFF;

BULK INSERT dbo.ProductTransactionImport
FROM 'C:\Temp\ProductTransactionExport.dat'
WITH
(
    DATAFILETYPE = 'widechar',
    FIRE_TRIGGERS,
    BATCHSIZE = 1
);

The FIRE_TRIGGERS option is required to fire the INSTEAD OF INSERT trigger. Without it, the command will fail with the error message shown previously. This is a useful defence against forgetting to specify the option.

The BATCHSIZE = 1 parameter tells SQL Server to process each line of the file using a separate transaction.

The net effect is to stream the file into SQL Server, firing the import processing trigger for each import line. A hidden temporary rowset is created to support the inserted trigger alias, but it only ever contains a single row.

The BULK INSERT command above successfully imports 504 Product rows and 113,443 TransactionHistory rows in just under two seconds. Each product entry receives a new identity value, the history rows are correctly linked, and the foreign key is marked as trusted.

This solution can be expanded to allow arbitrarily complex processing of per-row data, including adding a WAITFOR delay or other throttling mechanism to deliver trickle inserts to a highly contended system.

Larger Batch Sizes

The single-row batch size makes for a simple trigger implementation but may add a fair amount of overhead to the task as a whole. In the example, the effect isn’t too pronounced because each import row is quite large and may contain a decent number of target history rows in the JSON array.

In other circumstances, you may prefer to use a larger batch size in the BULK INSERT statement. This means using a cursor or set-based approach inside the trigger.

The cursor implementation is a straightforward extension of the single row code, driven from the inserted alias. Sadly, SQL Server doesn’t support a dynamic cursor on inserted so the cursor will make a copy of the current batch of rows in tempdb. This is the second copy of the batch since one was created for the rowset supporting the inserted alias, which is a little unfortunate. Still, the batch will normally not be too large. It might still be better than making a copy of the entire imported data all at once in a staging table, depending on your goals and priorities.

A set-based approach would work the same as a single large import, which normally involves a temporary table and a MERGE statement to capture assigned identity values in a mapping table. Again, the benefit is this processing occurs per batch rather than for the entire import all at once. I’ll show an example of this next.

Multi-row Trigger

An implementation of a multi-row trigger using MERGE and a mapping table:

-- Trigger capable of handling multi-row batches
CREATE OR ALTER TRIGGER ProductTransactionImport_IOI
ON dbo.ProductTransactionImport
INSTEAD OF INSERT
AS
IF ROWCOUNT_BIG() = 0 RETURN;
SET NOCOUNT, XACT_ABORT ON;
SET STATISTICS XML OFF;

-- Capture current batch with a surrogate key
-- This step can be skipped if the import contains
-- a *reliable* key
SELECT
    RowID = IDENTITY(integer, 1, 1),
    I.Product, 
    I.History 
INTO #Batch
FROM Inserted AS I;

-- Used to map surrogates to assigned identity values
CREATE TABLE #Map 
(
    RowID integer NOT NULL PRIMARY KEY,
    ProductID integer NOT NULL
);

-- Add a batch of products
WITH I AS
(
    SELECT 
        B.RowID,
        P.*
    FROM #Batch AS B
    CROSS APPLY OPENJSON(B.Product COLLATE Latin1_General_100_BIN2)
    WITH 
    (
        [Name] nvarchar(50), 
        ProductNumber nvarchar(25), 
        Color nvarchar(15), 
        Size nvarchar(5)
    ) AS P
)
MERGE dbo.Product AS P
USING I ON 0 = 1
WHEN NOT MATCHED BY TARGET THEN
    INSERT ([Name], ProductNumber, Color, Size)
    VALUES (I.[Name], I.ProductNumber, I.Color, I.Size)
OUTPUT 
    -- Capture surrogate to identity mapping
    I.RowID, Inserted.ProductID 
    INTO #Map (RowID, ProductID);

-- Add a batch of history rows
INSERT dbo.TransactionHistory 
    WITH (TABLOCK)
    (
        ProductID, 
        ReferenceOrderID, 
        ReferenceOrderLineID, 
        TransactionDate, 
        TransactionType, 
        Quantity, 
        ActualCost
    )
SELECT
    M.ProductID,
    H.*
FROM #Batch AS B
CROSS APPLY OPENJSON(B.History COLLATE Latin1_General_100_BIN2)
    WITH 
    (
        ReferenceOrderID integer,
        ReferenceOrderLineID integer,
        TransactionDate datetime,
        TransactionType nchar(1),
        Quantity integer,
        ActualCost money
    ) AS H
JOIN #Map AS M
    ON M.RowID = B.RowID;

Bulk insert processing the import in batches of 100 rows:

-- Reset target tables
TRUNCATE TABLE dbo.TransactionHistory;
DELETE dbo.Product;

SET NOCOUNT ON;
SET STATISTICS XML OFF;

BULK INSERT dbo.ProductTransactionImport
FROM 'C:\Temp\ProductTransactionExport.dat'
WITH
(
    DATAFILETYPE = 'widechar',
    FIRE_TRIGGERS,
    BATCHSIZE = 100
);

This code also correctly imports and links all data from the file.

Clean up by dropping the new tables and views:

DROP VIEW IF EXISTS 
    dbo.ProductTransactionExport,
    dbo.ProductTransactionImport;

DROP TABLE IF EXISTS 
    dbo.TransactionHistory, 
    dbo.Product;

Thanks for reading.