Transforming data in a data warehouse through SQL views
This post is an add on to the another post titled Designing an ETL process with SSIS: two approaches to extracting and transforming data where Phill Devey responded with the following question:
With regard to your statement " With staging tables, transformations are implemented as database views" Are you suggesting that your Dimension and Fact "tables" will be views or do you mean the query that populates the Dimension and Fact tables will be based on a view?
That’s a good question, and let me explain what I’ve found most practical.
Initially, when I create a new dimension or fact table, I simply create it as a view. Doing so allows me to quickly develop, test and debug the transformation in SSMS. If performance is not an issue, I might even deploy it this way into production.
In practice, once the transformation seems to work well, what I usually do is rename the view and create a stored procedure which creates a table using the original view. Because the new table has the same name as the original view, nothing breaks. Here’s what the stored procedure might look like:
1: CREATE PROCEDURE [dbo].[sp_AccountFact] 2: AS 3: BEGIN 4: SET NOCOUNT ON; 5: 6: 7: SELECT * 8: INTO #MyTempTable 9: FROM dbo.[Account Fact View] 10: 11: BEGIN TRANSACTION MY_TRANS 12: 13: IF EXISTS ( 14: SELECT [name] 15: FROM [sys].[tables] 16: WHERE name='Account Fact' 17: ) 18: BEGIN 19: DROP TABLE tabular.[Account Fact] 20: END; 21: 22: SELECT * 23: INTO tabular.[Account Fact] 24: FROM #MyTempTable 25: 26: COMMIT TRANSACTION MY_TRANS 27: 28: CREATE UNIQUE NONCLUSTERED INDEX [Account key] ON [tabular].[Account Fact] 29: ( 30: [Account key] ASC 31: )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] 32: 33: CREATE COLUMNSTORE INDEX [NonClusteredColumnStoreIndex-20130801-145157] 34: ON [tabular].[Account Fact]( 35: -- Add columns here 36: ); 37: 38: END
In this example, the stored procedure [dbo].[sp_AccountFact] does the heavy lifting of recreating the [tabular].[Account Fact] table. You would call this stored procedure during the transform phase of your ETL process.
The view with the transformation logic (which was previously called [tabular].[Account Fact] ) is now called [dbo].[Account Fact View] . The content of this view is first copied into a temporary table using the SELECT INTO statement. On line 11 we start a transaction so that the entire update is atomic. In lines 13-20 the existing table is deleted if it exists, and then in line 22 the table is recreated from the temporary table we created before, again using the SELECT INTO statement. Then the transaction is committed. Finally we recreate the indexes. Note that the second index in this example is a column store index which greatly improves the performance of most ad hoc queries.
The transaction ensures that clients performing queries on the data warehouse get valid results while the ETL is running. However, those clients might get blocked until the transaction completes. Therefore it’s important to minimize the duration of the transaction. This is why we first copy the view into a temporary table before the transaction starts. We also recreate the indexes outside the transaction for this reason.
Another design decision is whether to drop the table and recreate it – as I do here – or to just truncate the table. If we would just truncate the table, the stored procedure would become even simpler because we don’t need to check if the table already exists before dropping/truncating it, and we don't need to recreate the indexes. On the other hand, a benefit of using the drop-and-recreate method is that we only need to maintain the table’s schema in one place, i.e. in the view. In the truncate scenario we would need to update the schema of the table every time we modified the view. A second benefit is that it simplifies the usage of column store indexes. In SQL Server 2012, you cannot modify the contents of a table which has a column store index. (I believe this constraint no longer exists in SQL Server 2014, but I have not yet verified that myself).
Note: another good practice is to use schemas. In this data warehouse, the dimensional tables are in a schema called “tabular”. I also typically create a separate schema for each source system and its staging tables. The dbo schema has “internal” objects which are important for the ETL but are not visible to normal users. Normal users of the data warehouse only need read access to the tabular schema to perform the OLAP queries of their desire.