Improving Incremental Loads with Change Data Capture
In SQL Server, change data capture offers an effective solution to the challenge of efficiently performing incremental loads from source tables to data marts and data warehouses.
What is Change Data Capture?
Source tables change over time. A data mart or data warehouse that is based on those tables needs to reflect these changes. However, a process that periodically copies a snapshot of the entire source consumes too much time and resources. Alternate approaches that include timestamp columns, triggers, or complex queries often hurt performance and increase complexity. What is needed is a reliable stream of change data that is structured so that it can easily be applied by consumers to target representations of the data. Change data capture in SQL Server provides this solution.
The change data capture feature of the Database Engine captures insert, update, and delete activity applied to SQL Server tables, and makes the details of the changes available in an easily-consumed, relational format. The change tables used by change data capture contain columns that mirror the column structure of the tracked source tables, along with the metadata needed to understand the changes that have occurred on a row by row basis.
Note
Change data capture is available only in SQL Server 2008 Enterprise, Developer, and Evaluation editions.
For more information:Change Data Capture
How Change Data Capture Works in Integration Services
An Integration Services package can easily harvest the change data in the SQL Server databases to perform efficient incremental loads to a data warehouse. However, before you can use Integration Services to load change data, an administrator must enable change data capture on the database and the tables from which you want to capture changes. For more information on how to configure change data capture on a database, see Configuring Change Data Capture.
Once an administrator has enabled change data capture on the database, you can create a package that performs an incremental load of the change data. The following diagram shows the steps for creating such a package that performs an incremental load from a single table:
As shown in the previous diagram, creating a package that performs an incremental load of changed data involves the following steps:
Step 1: Designing the Control Flow
In the control flow in the package, the following tasks need to be defined:Calculate the starting and ending datetime values for the interval of changes to the source data that you want to retrieve.
To calculate these values, use an Execute SQL task or Integration Services expressions with datetime functions. You then store these endpoints in package variables for use later in the package.
For more information:Specifying an Interval of Change Data
Determine whether the change data for the selected interval is ready. This step is necessary because the asynchronous capture process might not yet have reached the selected endpoint.
To determine whether the data is ready, start with a For Loop container to delay execution, if necessary, until the change data for the selected interval is ready. Inside the loop container, use an Execute SQL task to query the time mapping tables maintained by change data capture. Then, use a Script task that calls the Thread.Sleep method, or another Execute SQL task with a WAITFOR statement, to delay the execution of the package temporarily, if necessary. Optionally, use another Script task to log an error condition or a timeout.
For more information:Determining Whether the Change Data Is Ready
Prepare the query string that will be used to query for the change data.
Use a Script task or an Execute SQL task to assemble the SQL statement that will be used to query for changes.
For more information:Preparing to Query for the Change Data
Step 2: Setting Up the Query for Change Data
Create the table-valued function that will query for the data.Use SQL Server Management Studio to develop and save the query.
For more information:Retrieving and Understanding the Change Data
Step 3: Designing the Data Flow
In the data flow of the package, the following tasks need to be defined:Retrieve the change data from the change tables.
To retrieve the data, use a source component to query the change tables for the changes that fall within the selected interval. The source calls a Transact-SQL table-valued function that you must have previously created.
For more information:Retrieving and Understanding the Change Data
Split the changes into inserts, updates, and deletes for processing.
To split the changes, use a Conditional Split transformation to direct inserts, updates, and deletes to different outputs for appropriate processing.
For more information:Processing Inserts, Updates, and Deletes
Apply the inserts, deletes, and updates to the destination.
To apply the changes to the destination, use a destination component to apply the inserts to the destination. Also, use OLE DB Command transformations with parameterized UPDATE and DELETE statements to apply updates and deletes to the destination. You can also apply updates and deletes by using destination components to save the rows to temporary tables. Then, use Execute SQL tasks to perform bulk update and bulk delete operations against the destination from the temporary tables.
For more information:Applying the Changes to the Destination
Working with Change Data from Multiple Tables
The process outlined in the previous diagram and steps involves an incremental load from a single table. When having to perform an incremental load from multiple tables, the overall process is the same. However, the design of the package needs to be changed to accommodate the processing of multiple tables. For more information on how to create a package that performs an incremental load from multiples tables, see Performing an Incremental Load of Multiple Tables.
Viewing a Complete Change Data Capture Package
Integration Services provides two samples that demonstrate how to use change data capture in packages. For more information, see the following topics:
External Resources
Blog entry, SSIS Design Pattern – Incremental Load, on sqlblog.com
|