Business Intelligence: Building a Data Foundation for a BI Solution
In this article, I'll show you how to design and build a simple data mart to illustrate how you can use SQL Server 2008 Integration Services (SSIS) to perform ETL for your own BI solution.
Derek Comingore
At a Glance:
- Following the dimensional model to build a data mart
- Developing ETL processes and data maps
- Using Integration Services for ETL
- Creating an Integration Services project in BIDS
Code download available at: SampleCode2009_08.exe (151 KB)
Data is the most important part of any business intelligence (BI) solution. As explained in Stacia Misner's, "Planning Your First Microsoft BI Solution," getting data into your BI solution and maintaining it once it's there involves several steps. BI professionals refer to these steps as extract, transform and load ( ETL) processes. Even if you don't plan to pursue a career focused on BI, you can still take advantage of ETL techniques and tools to manage the data you need to inform the day-to-day decisions you make in your job. In this article, I'll show you how to design and build a simple data mart to illustrate how you can use SQL Server 2008 Integration Services (SSIS) to perform ETL for your own BI solution.
Understanding Requirements
As with any IT project, the best way to start an ETL project is to understand the general requirements of the BI solution you want to build and then work your way back to the data to determine how best to support those requirements. The first article in the series set the stage for the BI solution the fictional company Adventure Works needs by describing the analytical requirements as several questions the company needs to answer. These questions highlight the need for Adventure Works to understand its product sales from different perspectives: profitability by distribution channel (resellers or the Internet), the change in demand for products over time, and the difference between actual sales and sales forecasts by product, salesperson, geographic region and reseller type. The answers to these questions will help Adventure Works decide which distribution channel the company should focus on to increase profits, how to adapt its manufacturing processes to best meet demand, and how changes in sales strategy might help the company meet its sales goals. You'll see how the data supports answering these business questions when you add SQL Server Reporting Services (SSRS) to the BI solution.
Before I start designing the data mart for Adventure Works with these requirements in mind, I want to model the information needs from a business perspective. In other words, a data mart's design should be based on the way the users will ask their questions rather than the way the data comes from the data sources.
You'll need the SQL Server 2008 Adventure Works OLTP sample database for the code samples found in this article.
Using the Dimensional Model
A data mart is typically built using a dimensional model design, which is a database schema well suited for analysis. (A great resource for learning about dimensional modeling is kimballgroup.com.) A dimensional model presents data in a way that's familiar to users, and it helps you build data structures that are optimized for querying high volumes of data. You can achieve this optimization by denormalizing the data, which helps the database engine rapidly select and efficiently aggregate a vast amount of data at query time. In the denormalized schema for the Adventure Works solution, I'll include two types of tables: dimension and fact. A dimension table stores information about business entities and objects, such as resellers or products. A fact table, which I'll use for the numerical sales data I need to aggregate, contains numerical measures and keys relating facts to the dimension tables. I'll explain more about fact tables later in this article.
You can implement the tables of a dimensional model in two types of schemas: star and snowflake. In simple terms, a star schema uses one table for each dimension, which a query in turn relates to a fact table with a single join. A snowflake schema uses two or more tables for each dimension and consequently requires more joins in a query to see all the data. This collection of cascading joins means that queries often run more slowly in a snowflake schema than in a star schema. For the purposes of this article, I'll keep the design simple and use a star schema.
Creating a Bus Matrix
As part of the dimensional modeling process, I create a bus matrix to help identify the dimensions associated with sales, the focus of the BI solution for Adventure Works. Remember that Adventure Works has two sales channels: wholesale sales to resellers and individual sales through the Internet. I also use the bus matrix to identify the relationship of each dimension to one or both types of sales. Figure 1 shows my sample bus matrix for Adventure Works Sales.
Adventure Works Sales |
Date | Product | Customer | Promo | Geo | Re- seller |
Sales Territory |
Employee | Currency |
Internet Sales | X | X | X | X | X | X | |||
Reseller Sales | X | X | X | X | X | X | X |
Figure 1 Bus Matrix for Adventure Works Sales
My next step is to determine the measures for the solution. Measures are the numerical values that are required for analysis. They can come directly from the source, such as sales dollars or product costs, or be derived through a calculation, such as multiplying a quantity by a dollar amount for an extended sales amount. I also need to decide which attributes to include with each dimension. Attributes are the individual elements in a dimension (corresponding to columns in a table) that you can use to group or filter data for analysis, such as Country in the Sales Territory dimension or Year in the Date dimension. I won't detail every identified measure or dimensional attribute in this article—just realize that the identification process needs to take place.
Creating a Data Map
Before creating physical tables of the data mart, I need to do some additional planning. Specifically, I need to craft a data map document to map each destination column in the data mart schema to the columns in the Adventure Works OLTP source system (the AdventureWorks2008 database that you can download and install as described in Stacia Misner's article on p. 31). You can use various applications to create a data map. The format isn't as important as the content. I like to develop data maps in Microsoft Office Excel. Figure 2 shows the DimProduct tab I created in my data map. I also created DimCustomer and FactInternetSales data maps. Each sheet in the workbook represents one of the tables in my data mart. On each sheet, I simply have two columns: one for the source column and one for the destination column.
Figure 2 DimProduct Data Mapping Tab
Each dimension table (except for the Date dimension) has a primary key known as a surrogate key (usually an identity column). One benefit of using surrogate keys is that you can combine data from multiple systems without the risk of duplicate keys.
The dimension tables also have an alternate key column. These alternate keys represent natural keys, sometimes called business keys. These are the source system's identifiers. For example, the CustomerAlternateKey column in the Customer dimension maps to the Adventure Works OLTP database field AccountNumber in the Sales.Customer table. By storing these keys in the dimension table, I have a way to match up records already in the dimension with records extracted from the source whenever I run an ETL process for each dimension.
Amost every data mart includes a Date dimension because business analysis often compares changes in measures by date, week, month, quarter or year. The Date dimension almost never comes from a source system, so the reasons for using a SQL Server IDENTITY–based key don't apply. Instead, I'll use what is called a smart key with a format of YYYYMMDD stored as a SQL Server Integer column. A smart key is a key that is generated from logic or scripts as opposed to being an auto-incrementing key such as an IDENTITY column in SQL Server.
Keep in mind that the Date dimension doesn't usually map to a source table. Instead, I'll use a script to generate the data to load records into the table.
Because the ETL processes required for my small schema are fairly simple, my data map is fine as is. In a real-world project, I would annotate the data map to point out when complex transformations are required.
Building the Data Mart
Now that logical modeling is complete, I need to create the physical tables that the ETL processes will load and a host database for these tables. I'll use a basic T-SQL script to create both my database and its associated dimension and fact tables. You can find the entire T-SQL script in the accompanying download for the sample BI solution at (2009 Code Downloads).
For the purposes of this article, I build only a subset of the entire Sales data mart schema so I can cover the entire ETL process in SSIS. In the smaller version of the schema, I include only the OrderQuantity and SalesAmount measures for the Internet Sales fact table. Additionally, my smaller schema includes a simplified version of the Customer, Product and Date dimension tables.
Developing ETL Processes
Designing and developing ETL processes is the next step in building a BI solution. To review, ETL includes all the technological processes by which data is extracted from data sources, transformed, and then loaded into a destination repository. Commonly, ETL processes within BI solutions extract data from flat files and OLTP operational databases, alter the data to fit a dimensional model (e.g., a star schema), and then load the resulting data into a data mart.
Creating an SSIS Project in BIDS
The first step in developing an ETL process is to create a new project in Business Intelligence Development Studio ( BIDS). BIDS comes with SQL Server 2008 and installs when you select the Workstation Components option during the installation process. BIDS includes project templates for SSIS, SSAS and SSRS. It also supports source control integration, just as Visual Studio does.
To start BIDS, go to Start\Programs\Microsoft SQL Server 2008\Business Intelligence Development Studio and select File\New Project. You should see the New Project template shown in Figure 3.
Figure 3 New Project Template in BIDS 2008
Select Integration Services Project in the Templates pane, type ssis_TECHNET_AW2008 in the Name text box, and then click OK. BIDS should now display an open SSIS project.
Creating Common Data Connections
Another nice feature in SSIS 2008 is the ability to create data source connections outside of individual packages. You can define a data source connection once and then reference it in one or more SSIS packages within the solution. To learn more about creating BIDS data sources, refer to "How to: Define a Data Source Using the Data Source Wizard (Analysis Services)".
Create two new data source connections: one for the TECHNET_AW2008SalesDataMart database and another for the AdventureWorks2008 OLTP database. Name the data source connections AW_DM.ds and AW_OLTP.ds, respectively.
Developing the ETL for Dimensions
The ETL to load the Product dimension is very simple. I need to extract the data from the Adventure Works Production.Product table and load the data into the TECHNET_AW2008SalesDataMart database. First, I need to rename the default package that BIDS created for my SSIS project. (A package is a container for all the steps in the workflow that SSIS will execute.) Right-click the default package in Solution Explorer, and select Rename. Type DIM_PRODUCT.dtsx and then press Enter.
Next, I need to create local package connection managers using the prebuilt data sources. Create two new connection managers referencing the data sources built earlier.
Defining a Data Flow to Extract and Load
SSIS includes a data flow task that encapsulates everything I need to implement the ETL for a simple dimension. I just drag a data flow task from the Toolbox onto the Control Flow designer's surface and rename the task to EL (for extract and load). Right-click the Data Flow task in the designer, and select Edit. BIDS now displays the Data Flow designer.
The extract portion of the Product dimension package needs to query the AdventureWorks2008 Production.Product table. To set up this task, I drag an OLE DB Source component from the Toolbox onto the Data Flow designer surface and then rename the OLE DB Source component to AW_OLTP.
Next, I define the load portion of my package to load into the data mart. I simply drag a new instance of the OLE DB Destination component onto the Data Flow designer surface and rename it AW_OLTP. I then click the OLE DB Source (AW_OLTP) component and drag the green arrow that appears on the OLE DB Source to the AW_DM OLE DB Destination component to connect the two components.
At this point, I've added the necessary components to the data flow, but I still need to configure each component so that SSIS knows how I want to extract and load the data. Right-click the AW_DM OLE DB Destination component, and select Edit. With the OLE DB Destination Editor open, I make sure AW_DM is selected as the OLE DB connection manager. Then I expand the Name of the table drop-down and select the dbo.DimProduct table. Finally, I click the Mappings tab to confirm that the mappings are correct. I click OK to confirm the mappings. This process is much easier when you have a data map ready for reference, especially if you're working with large tables. The Product dimension's ETL package is now complete.
You can execute the package inside BIDS easily. To test the Product dimension package, open the package and press F5.
Developing Additional Packages
I create the Customer Dimension package in the same way I did the Product package. I won't reiterate the steps you need to follow to create this new package. You should try to produce it on your own. Notice that this package uses an XML data-typed column ( Person.Person.Demographics) in the source, which requires you to parse out the individual demographic-related attributes. To parse individual values from a SQL Server XML data-typed column, you can leverage an XQuery with the XML data type's native Value ( ) method. Name the finished package DIM_CUSTOMER.dtsx.
Developing an SSIS package for the Date dimension is optional. Because this dimension doesn't usually have source data, the easiest way to load it is by using a basic T-SQL script. You can find the script I used in the completed solution.
Developing the Internet Sales Fact Table Package
The Internet Sales fact table package queries for all Internet Sales and returns those sales broken down by Product, Customer and Date (i.e., order date). Unlike a dimension package, a fact table package requires an additional step to look up the surrogate and smart keys in the corresponding dimension tables prior to loading the data into the fact table. You can create a new package and name it FACT_INTERNET_SALES.dtsx.
The extract portion of the package needs to query the AdventureWorks2008 OLTP database using the T-SQL code shown in Figure 4.
SELECT
P.ProductID
,CONVERT(INT,CONVERT( CHAR(4), YEAR(H.OrderDate) ) + RIGHT('0'+ CONVERT(VARCHAR(3),
MONTH(H.OrderDate) ),2) + RIGHT('0'+ CONVERT( VARCHAR(3),
DAY(H.OrderDate) ),2)) AS OrderDateKey
,C.AccountNumber
,SUM(D.OrderQty) AS OrderQuantity
,SUM(D.LineTotal) AS SalesAmount
FROM
[Sales].[SalesOrderDetail] D
INNER JOIN
[Sales].[SalesOrderHeader] H ON (D.SalesOrderID = H.SalesOrderID)
INNER JOIN
[Production].[Product] P ON (D.ProductID = P.ProductID)
INNER JOIN
[Sales].[Customer] C ON (H.CustomerID = C.CustomerID)
WHERE
H.OnlineOrderFlag = 1
GROUP BY
P.ProductID
,H.OrderDate
,C.AccountNumber
Figure 4 T-SQL Code for Internet Sales by Product, Date and Customer
Create a new data flow task in the package's control flow surface. Open the Data Flow designer, and create an OLE DB Source component. Name the component AW_OLTP, and use the query in Figure 4 as its source. This query produces an aggregation (sum) for the OrderQuantity and SalesAmount measures found in the Adventure Works Sales tables.
Now you need to configure a lookup transformation. Drag two new instances of the Lookup Transformation component from the Toolbox to the Data Flow designer surface and name them Product and Customer. Configure the first one ( Product) to look up the ProductKey in the Product dimension table by joining the AlternateKey of the dimension table to the incoming ProductID field from the AW_OLTP source query.
Configure the second one (Customer) to look up the CustomerKey in the Customer dimension table by joining the AlternateKey of the dimension table to the incoming AccountNumber field from the AW_OLTP source query.
Final Steps
The final step is to load the data into the FactInternetSales fact table, replacing the natural keys for each dimension with the surrogate keys the lookup transformation found. Drag over a new instance of the OLE DB Destination component, and name it AW_DM. Edit the OLE DB Destination component, and select the AW_DM connection manager. Next, select the dbo.FactInternetSales table and click the Mappings tab. Ensure that the mappings look like those in Figure 5. Click OK to complete the package's logic.
Figure 5 OLE DB Destination Mappings for the Fact Internet Sales Fact Table
To test the Internet Sales Fact package, open the package in BIDS and press F5.
You now understand the basics of dimensional modeling and building ETL-designed packages with SSIS. In the third article in this series, you learn how to use a populated data mart to create dimensions and cubes for an SSAS database. Once you've built a cube, you can then develop an SSIS package to update these objects continuously in the SSAS database each time new data is added to the data mart. SSIS can even prepare data for display in an SSRS report when the report requirements can't be met with a single query. As you can see, SSIS can do a lot more to help you manage your BI solution than just ETL processing.
Derek Comingore* is a Senior Architect with ComFrame. As a SQL Server MVP and a certified professional in several Microsoft technologies, including SQL Server BI, he speaks at local and national SQL Server user groups and is a published author on SQL Server topics. Derek focuses on building and delivering enterprise-class data warehousing and BI solutions with SQL Server along with the other Microsoft BI software products. You can reach Derek at dcomingore@comframe.com*.