Share via


Large Scale DW/BI Design Considerations

One of the most challenging aspects of large scale data warehouse development is the sheer immense size. It’s common to see all sorts of problems once the data warehouse gets larger. Common problems include performance degradation, out of storage space, long running queries, blocking and failure to keep up with data feeds.

I have compiled a list of tips that help me to manage the size effectively over the years on terabyte DW projects. I hope they will be beneficial to you as well. Here we go:

Keep transaction small: The goal is to reduce blocking or even deadlocks. In my own implementation, there's almost no usage of BEGIN TRAN...... COMMIT TRAN. Sometimes, there's obvious place you can tweak to keep transaction small. For example, let’s say you have a data source that contains both new dimension and fact data. They should be able to put in two different transactions. Now, it's not always easy to separate statements of a transaction into individual ones since it may affect the data accuracy. For example, if you have an ETL workitem queue table where you keep track of what raw data has been processed. To avoid double counting, the insertion into fact table and updating completion of the workitem in the queue table are usually in the same transaction. Separating them can still be achieved and that leads to my next tip, Fault tolerant and robust design.

Fault tolerant and robust design: This is a design principle that requires a different approach of schema design and implementation. With the goal to keep transaction small and short, the goal is to only touch one table at a time. Using the workitem queue table and fact table example again, without a transaction, the logic goes like this.

1. Mark workitem processing start time

2. Delete fact table if there's any row associated with the workitem

3. Process

4. Insert fact table with new rows from the workitem

5. Mark workitem processing end time

The trick is on step 2 where we delete any left over, half way rows associated with the workitem in the fact table. In normal situation, there's nothing to delete and this step is no ops. However if there's a failure between 4 and 5, when we reprocess the same workitem, step 2 will remove the previous insertion, reprocess and insert fact table without double counting.

Putting everything in a transaction is an easy implementation and it might have passed test, integration, PPE and runs smoothly in production. But one day after several months in production, you begin to see long blocking and even deadlocks. The time you spent on trouble shooting can be several times more expensive than if you'd have implemented differently. I will explain why this can happen in the next tip, Load as much data as possible in the test phase.

Load as much data as possible in the test phase: When designed and implemented correctly, the ETL time for the same amount of raw data should remain the same. However, it's common to see this ETL time taking longer gradually or going up to a cliff. Common symptoms are index fragmentations, blocking and plan changes. However they all resulted from the same root cause-- underlying data changes. Your data may be seasonal where some values become rare or frequent and that can change the stats. The data may grow too large to use the same access method. Loop join in a plan can be changed to a hash or merge join. You bottle neck can hence be interchanged among disk, memory or CPU. The scenario goes on and on so allocate more time for you load testing and try as much data as possible during the dev/test phases.

Watch the batch size: This is orthogonal to the first tip. While we want to keep transaction small by avoiding touching several tables in one same transaction, we also want to make sure each transaction is limited to a manageable batch size. Other than for the benefit of less blocking, a reasonable batch size can also help to alleviate the pressure on the log file size.

Use partitioned table: This is one of the best new SQL server features for DW development. By using a rolling sliding window, you can effectively manage your vast amount of data. Kimberly Tripp has done an excellent white paper on this topic at https://msdn.microsoft.com/en-us/library/ms345146(SQL.90).aspx. The only thing I want to add is that you should make the sliding window size configurable. When your data source grows, the DW may not be able to accommodate the same number of partitions. Simply by decreasing the value of the partition number parameter to the procedure that manages the sliding window will get the disk space back in a split second.

Use Compression: In my application of page level compression on the fact table, I see up to 66% saving on disk space while not observing any changes in the overall performance. Some people may even see performance increase on IO since less disk space is used to store the same amount of information. Read more at https://msdn.microsoft.com/en-us/library/cc280449.aspx and https://msdn.microsoft.com/en-us/library/cc280464.aspx.

Use Indexed view for aggregation: An indexed view is materialized on disk and the rollups are updated whenever the underlying fact data changes. This feature helps to keep the rollups current without a separate process to manage or calculate the rollups. The only thing need to be done is to define and create the indexed view. Read more at https://msdn.microsoft.com/en-us/library/ms187864.aspx.

Design for Scale out: Two common reasons for scaling out are for processing throughput or storage space. Unfortunately, most people found out they need to scale out when it’s too late after months in production. At that awkward moment, it’s going to be really hard to alleviate without a major overhaul if the system is never designed for scaling out. So always keep scaling out in mind when you architect the DW. For example, if your data source if file based, then it’s better to use UNC file path than local drive letter file path. The same idea goes to your destination fact table. It can be on different servers when scaling out. Also, try to use procedures and avoid using base tables or views. With the abstraction of procedure interface, the data processing or retrieving can be from different servers and the insertion of data can scale out to different locations or without breaking the interface.

It's the hottest day I ever have in Seattle with the highest temparature at around 95F. (Don't laugh, Texans :)) I heard it can be triple digit, record setting hot tomorrow. So leave a comment if I mumble too much or if you think my head is over heated. :)

Keep it cool and happy coding!

Mike Chang

 

SQL SDE,

MSDN