Optimizing MERGE Statement Performance
In SQL Server 2008, you can perform multiple data manipulation language (DML) operations in a single statement by using the MERGE statement. For example, you may need to synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table. Typically, this is done by executing a stored procedure or batch that contains individual INSERT, UPDATE, and DELETE statements. However, this means that the data in both the source and target tables are evaluated and processed multiple times; at least once for each statement.
By using the MERGE statement, you can replace the individual DML statements with a single statement. This can improve query performance because the operations are performed within a single statement, therefore, minimizing the number of times the data in the source and target tables are processed. However, performance gains depend on having correct indexes, joins, and other considerations in place. This topic provides best practice recommendations to help you achieve optimal performance when using the MERGE statement.
Index Best Practices
To improve the performance of the MERGE statement, we recommend the following index guidelines:
Create an index on the join columns in the source table that is unique and covering.
Create a unique clustered index on the join columns in the target table.
These indexes ensure that the join keys are unique and the data in the tables is sorted. Query performance is improved because the query optimizer does not need to perform extra validation processing to locate and update duplicate rows and additional sort operations are not necessary.
For example, in the following MERGE statement the source table, dbo.Purchases, and the target table, dbo.FactBuyingHabits, are joined on the columns ProductID and CustomerID. To improve the performance of this statement, you would create a unique or primary key index (clustered or nonclustered) on the ProductID and CustomerID columns on the dbo.Purchases table and a clustered index on the ProductID and CustomerID columns on the dbo.FactBuyingHabits table. To view the code used to create these tables, see Inserting, Updating, and Deleting Data by Using MERGE.
MERGE dbo.FactBuyingHabits AS Target
USING (SELECT CustomerID, ProductID, PurchaseDate FROM dbo.Purchases) AS Source
ON (Target.ProductID = Source.ProductID AND Target.CustomerID = Source.CustomerID)
WHEN MATCHED THEN
UPDATE SET Target.LastPurchaseDate = Source.PurchaseDate
WHEN NOT MATCHED BY TARGET THEN
INSERT (CustomerID, ProductID, LastPurchaseDate)
VALUES (Source.CustomerID, Source.ProductID, Source.PurchaseDate)
OUTPUT $action, Inserted.*, Deleted.*;
JOIN Best Practices
To improve the performance of the MERGE statement and ensure correct results are obtained, we recommend the following join guidelines:
- Specify only search conditions in the ON <merge_search_condition> clause that determine the criteria for matching data in the source and target tables. That is, specify only columns from the target table that are compared to the corresponding columns of the source table. Do not include comparisons to other values such as a constant.
To filter out rows from the source or target tables, use one of the following methods.
Specify the search condition for row filtering in the appropriate WHEN clause. For example, WHEN NOT MATCHED AND S.EmployeeName LIKE 'S%' THEN INSERT....
Define a view on the source or target that returns the filtered rows and reference the view as the source or target table. If the view is defined on the target table, any actions against it must satisfy the conditions for updating views. For more information about updating data by using a view, see Modifying Data Through a View.
Use the WITH <common table expression> clause to filter out rows from the source or target tables. This method is similar to specifying additional search criteria in the ON clause and may produce incorrect results. We recommend that you avoid using this method or test thoroughly before implementing it.
For more information see, Inserting, Updating, and Deleting Data by Using MERGE.
Query Optimization of Joins
The join operation in the MERGE statement is optimized in the same way as a join in a SELECT statement. That is, when SQL Server processes joins, the query optimizer chooses the most efficient method (out of several possibilities) of processing the join. For more information about joins, see Join Fundamentals and Advanced Query Tuning Concepts. When the source and target are of similar size and the index guidelines described previously in the 'Index Best Practices' section are applied to the source and target tables, a merge join operator is the most efficient query plan. This is because both tables are scanned once and there is no need to sort the data. When the source is smaller than the target table, a nested loops operator is preferable.
You can force the use of a specific join by specifying the OPTION (<query_hint>) clause in the MERGE statement. We recommend that you do not use the hash join as a query hint for MERGE statements because this join type does not use indexes. For more information about query hints, see Query Hints (Transact-SQL). The following example specifies a nested loop join in the OPTION clause.
USE AdventureWorks;
GO
BEGIN TRAN;
MERGE Production.ProductInventory AS pi
USING (SELECT ProductID, SUM(OrderQty)
FROM Sales.SalesOrderDetail AS sod
JOIN Sales.SalesOrderHeader AS soh
ON sod.SalesOrderID = soh.SalesOrderID
AND soh.OrderDate BETWEEN '20030701' AND '20030731'
GROUP BY ProductID) AS src(ProductID, OrderQty)
ON (pi.ProductID = src.ProductID)
WHEN MATCHED AND pi.Quantity - src.OrderQty >= 0
THEN UPDATE SET pi.Quantity = pi.Quantity - src.OrderQty
WHEN MATCHED AND pi.Quantity - src.OrderQty <= 0
THEN DELETE
OUTPUT $action, Inserted.*, Deleted.*
OPTION (LOOP JOIN);
GO
ROLLBACK TRAN;
Parameterization Best Practices
If a SELECT, INSERT, UPDATE, or DELETE statement is executed without parameters, the SQL Server query optimizer may choose to parameterize the statement internally. This means that any literal values that are contained in the query are substituted with parameters. For example, the statement INSERT dbo.MyTable (Col1, Col2) VALUES (1, 10), may be implemented internally as INSERT dbo.MyTable (Col1, Col2) VALUES (@p1, @p2). This process, called simple parameterization, increases the ability of the relational engine to match new SQL statements with existing, previously-compiled execution plans. Query performance may be improved because the frequency of query compilations and recompilations are reduced. The query optimizer does not apply the simple parameterization process to MERGE statements. Therefore, MERGE statements that contain literal values may not perform as well as individual INSERT, UPDATE, or DELETE statements because a new plan is compiled each time the MERGE statement is executed.
To improve query performance, we recommend the following parameterization guidelines:
Parameterize all literal values in the ON <merge_search_condition> clause and in the the WHEN clauses of the MERGE statement. For example, you can incorporate the MERGE statement into a stored procedure replacing the literal values with appropriate input parameters.
If you cannot parameterize the statement, create a plan guide of type TEMPLATE and specify the PARAMETERIZATION FORCED query hint in the plan guide. For more information, see Specifying Query Parameterization Behavior by Using Plan Guides.
If MERGE statements are executed frequently on the database, consider setting the PARAMETERIZATION option on the database to FORCED. Use caution when setting this option. The PARAMETERIZATION option is a database-level setting and affects how all queries against the database are processed. For more information, see Forced Parameterization.
TOP Clause Best Practices
In the MERGE statement, the TOP clause specifies the number or percentage of rows that are affected after the source table and the target table are joined, and after rows that do not qualify for an insert, update, or delete action are removed. The TOP clause further reduces the number of joined rows to the specified value and the insert, update, or delete actions are applied to the remaining joined rows in an unordered fashion. That is, there is no order in which the rows are distributed among the actions defined in the WHEN clauses. For example, specifying TOP (10) affects 10 rows; of these rows, 7 may be updated and 3 inserted, or 1 may be deleted, 5 updated, and 4 inserted and so on.
It is common to use the TOP clause to perform data manipulation language (DML) operations on a large table in batches. When using the TOP clause in the MERGE statement for this purpose, it is important to understand the following implications.
I/O performance may be affected.
The MERGE statement performs a full table scan of both the source and target tables. Dividing the operation into batches reduces the number of write operations performed per batch; however, each batch will perform a full table scan of the source and target tables. The resulting read activity may affect the performance of the query.
Incorrect results can occur.
It is important to ensure that all successive batches target new rows or undesired behavior such as incorrectly inserting duplicate rows into the tartget table can occur. This can happen when the source table includes a row that was not in a target batch but was in the overall target table.
To insure correct results:
Use the ON clause to determine which source rows affect existing target rows and which are genuinely new.
Use an additional condition in the WHEN MATCHED clause to determine if the target row has already been updated by a previous batch.
Because the TOP clause is only applied after these clauses are applied, each execution either inserts one genuinely unmatched row or updates one existing row. The following example creates a source and target table and then presents the correct method of using the TOP clause to modify the target in batch operations.
CREATE TABLE dbo.inventory(item_key int NOT NULL PRIMARY KEY, amount int, is_current bit) GO CREATE TABLE dbo.net_changes(item_key int NOT NULL PRIMARY KEY, amount int) GO MERGE TOP(1) dbo.inventory USING dbo.net_changes ON inventory.item_key = net_changes.item_key WHEN MATCHED AND inventory.is_current = 0 THEN UPDATE SET amount += net_changes.amount, is_current = 1 WHEN NOT MATCHED BY TARGET THEN INSERT (item_key, amount, is_current) VALUES(item_key, amount, 1) OUTPUT deleted.*, $action, inserted.*; GO
The following example demonstrates an incorrect method of implementing the TOP clause. The check of the is_current column is specified in the join condition with the source table. This means that a source row used in one batch will be treated as "not matched" in the next batch, resulting in an unwanted insert operation.
MERGE TOP(1) dbo.inventory USING dbo.net_changes ON inventory.item_key = net_changes.item_key AND inventory.is_current = 0 WHEN MATCHED THEN UPDATE SET amount += net_changes.amount, is_current = 1 WHEN NOT MATCHED BY TARGET THEN INSERT (item_key, amount, is_current) values(item_key, amount, 1) OUTPUT deleted.*, $action, inserted.*; GO
The following example also demonstrates an incorrect method. By using a common table expression (CTE) to restrict the number of rows read for the batch, any source row which would have matched a target row other than the one selected by the TOP(1) is treated as "not matched", resulting in an unwanted insert operation. In addition, this method only limits the number of rows that can be updated; each batch will try to insert all of the "unmatched" source rows.
WITH target_batch AS ( SELECT TOP(1) * FROM dbo.inventory WHERE is_current = 0 ) MERGE target_batch USING dbo.net_changes ON target_batch.item_key = net_changes.item_key WHEN MATCHED THEN UPDATE SET amount += net_changes.amount, is_current = 1 WHEN NOT MATCHED BY TARGET THEN INSERT (item_key, amount, is_current) values(item_key, amount, 1) OUTPUT deleted.*, $action, inserted.*; GO
Bulk Load Best Practices
The MERGE statement can be used to efficiently bulk load data from a source data file into a target table by specifying the OPENROWSET(BULK…) clause as the table source. By doing so, the entire file is processed in a single batch.
To improve the performance of the bulk merge process, we recommend the following guidelines:
Create a clustered index on the join columns in the target table.
Use the ORDER and UNIQUE hints in the OPENROWSET(BULK…) clause to specify how the source data file is sorted.
By default, the bulk operation assumes the data file is unordered. Therefore, it is important that the source data is sorted according to the clustered index on the target table and that the ORDER hint is used to indicate the order so that the query optimizer can generate a more efficient query plan. Hints are validated at runtime; if the data stream does not conform to the specified hints, an error is raised.
These guidelines ensure that the join keys are unique and the sort order of the data in the source file matches the target table. Query performance is improved because additional sort operations are not necessary and unnecessary data copies are not required. The following example uses the MERGE statement to bulk load data from StockData.txt, a flat file, into the target table dbo.Stock. By defining a primary key constraint on StockName in the target table, a clustered index is created on the column used to join with the source data. The ORDER and UNIQUE hints are applied to the Stock column in the data source, which maps to the clustered index key column in the target table.
Before running this example, create a text file named 'StockData.txt' in the folder C:\SQLFiles\. The file should have two columns of data separated by a comma. For example, use the following data.
Alpine mountain bike,100
Brake set,22
Cushion,5
Next, create an xml format file named 'BulkloadFormatFile.xml' in the folder C:\SQLFiles\. Use the following information.
<?xml version="1.0"?>
<BCPFORMAT xmlns="https://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="25"/>
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="5"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="Stock" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="2" NAME="Delta" xsi:type="SQLSMALLINT"/>
</ROW>
</BCPFORMAT>
USE AdventureWorks;
GO
CREATE TABLE dbo.Stock (StockName nvarchar(50) PRIMARY KEY, Qty int CHECK (Qty > 0));
GO
MERGE dbo.Stock AS s
USING OPENROWSET (
BULK 'C:\SQLFiles\StockData.txt',
FORMATFILE = 'C:\SQLFiles\BulkloadFormatFile.xml',
ROWS_PER_BATCH = 15000,
ORDER (Stock) UNIQUE) AS b
ON s.StockName = b.Stock
WHEN MATCHED AND (Qty + Delta = 0) THEN DELETE
WHEN MATCHED THEN UPDATE SET Qty += Delta
WHEN NOT MATCHED THEN INSERT VALUES (Stock, Delta);
GO
Measuring and Diagnosing MERGE Performance
The following features are available to assist you in measuring and diagnosing the performance of MERGE statements.
Use the merge stmt counter in the sys.dm_exec_query_optimizer_info dynamic management to return the number of query optimizations that are for MERGE statements.
Use the merge_action_type attribute in the sys.dm_exec_plan_attributes dynamic management function to return the type of trigger execution plan used as the result of a MERGE statement.
Use SQL Trace to gather troubleshooting data for the MERGE statement in the same way you would for other data manipulation language (DML) statements. For more information, see Introducing SQL Trace.