Data warehouse integration
From: Developing big data solutions on Microsoft Azure HDInsight
Having explored Report level integration and Corporate data model integration for visualizing total sales and total website hits, the BI developer at Adventure Works now needs to respond to requests from business analysts who would like to use the web server log information to analyze page hits for individual products, and include this information in managed and self-service reports.
The IIS log entries include the query string passed to the web page, which for pages that display information about a product includes a productid parameter with a value that matches the relevant product code. Details of products are already stored in the data warehouse but, because product is implemented as a slowly changing dimension, the product records in the data warehouse are uniquely identified by a surrogate key and not by the original product code. Although the product code referenced in the query string is retained as an alternate key, it is not guaranteed to be unique if details of the product have changed over time.
The requirement to match the product code to the surrogate key for the appropriate version of the product makes integration at the report or corporate data model levels problematic. It is possible to perform complex lookups to find the appropriate surrogate key for an alternate key at any level at the time a specific page hit occurred (assuming both the surrogate and alternate keys for the products are included in the data model or report dataset). However, it is more practical to integrate the IIS log data into the dimensional model of the data warehouse so that the relationship with the product dimension (and the date dimension) is present throughout the entire enterprise BI stack.
The most problematic task for achieving integration with BI systems at data warehouse level is typically matching the keys in the source data with the correct surrogate key in the data warehouse tables where changes to the existing data over time prompt the use of an alternate key.
Changes to the data warehouse schema
To integrate the IIS log data into the dimensional model of the data warehouse, the BI developer creates a new fact table in the data warehouse for the IIS log data, as shown in the following Transact-SQL statement.
CREATE TABLE [dbo].[FactIISLog](
[LogDateKey] [int] NOT NULL REFERENCES DimDate(DateKey),
[ProductKey] [int] NOT NULL REFERENCES DimProduct(ProductKey),
[BytesSent] decimal NULL, [BytesReceived] decimal NULL, [PageHits] integer NULL);
The table will be loaded with new log data on a regular schedule as part of the ETL process for the data warehouse. In common with most data warehouse ETL processes, the solution at Adventure Works makes use of staging tables as an interim store for new data, making it easier to coordinate data loads into multiple tables and perform lookups for surrogate key values. A staging table is created in a separate staging schema using the following Transact-SQL statement.
CREATE TABLE staging.IISLog([LogDate] nvarchar(50) NOT NULL,
[ProductID] nvarchar(50) NOT NULL, [BytesSent] decimal NULL,
[BytesReceived] decimal NULL, [PageHits] int NULL);
Good practice when regularly loading data into a data warehouse is to minimize the amount of data extracted from each data source so that only data that has been inserted or modified since the last refresh cycle is included. This minimizes extraction and load times, and reduces the impact of the ETL process on network bandwidth and storage utilization. There are many common techniques you can use to restrict extractions to only modified data, and some data sources support change tracking or change data capture (CDC) capabilities to simplify this.
In the absence of support in Hive tables for restricting extractions to only modified data, the BI developers at Adventure Works have decided to use a common pattern that is often referred to as a high water mark technique. In this pattern the highest log date value that has been loaded into the data warehouse is recorded, and used as a filter boundary for the next extraction. To facilitate this, the following Transact-SQL statement is used to create an extraction log table and initialize it with a default value.
CREATE TABLE staging.highwater([ExtractDate] datetime DEFAULT GETDATE(),
[HighValue] nvarchar(200));
INSERT INTO staging.highwater (HighValue) VALUES ('0000-00-00');
Extracting data from HDInsight
After the modifications to the data warehouse schema and staging area have been made, a solution to extract the IIS log data from HDInsight can be designed. There are many ways to transfer data from HDInsight to a SQL Server database, including:
- Using Sqoop to export the data from HDInsight and push it to SQL Server.
- Using PolyBase to combine the data in an HDInsight cluster with a Microsoft Analytics Platform System (APS) database (PolyBase is available only in APS appliances).
- Using SSIS to extract the data from HDInsight and load it into SQL Server.
In the case of the Adventure Works scenario, the data warehouse is hosted on an on-premises server that cannot be accessed from outside the corporate firewall. Since the HDInsight cluster is hosted externally in Azure, the use of Sqoop to push the data to SQL Server is not a viable option. In addition, the SQL Server instance used to host the data warehouse is running SQL Server 2012 Enterprise Edition, not APS, and PolyBase cannot be used in this scenario.
The most appropriate option, therefore, is to use SSIS to implement a package that transfers data from HDInsight to the staging table. Since SSIS is already used as the ETL platform for loading data from other business sources to the data warehouse, this option also reduces the development and management challenges for implementing an ETL solution to extract data from HDInsight.
Note
The document Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) contains a wealth of useful information about using SSIS with HDInsight.
The control flow for an SSIS package to extract the IIS log data from HDInsight is shown in Figure 1.
Figure 1 - SSIS control flow for HDInsight data extraction
This control flow performs the following sequence of tasks:
- An Execute SQL task that extracts the HighValue value from the staging.highwater table and stores it in a variable named HighWaterMark.
- An Execute SQL task that truncates the staging.IISLog table, removing any rows left over from the last time the extraction process was executed.
- A Data Flow task that transfers the data from HDInsight to the staging table.
- An Execute SQL task that updates the HighValue value in staging.highwater with the highest log date value that has been extracted.
The Data Flow task in step 3 extracts the data from HDInsight and performs any necessary data type validation and conversion before loading it into a staging table, as shown in Figure 2.
Figure 2 - SSIS data flow to extract data from HDInsight
The data flow consists of the following sequence of components:
- An ODBC Source that extracts data from the iis_log Hive table in the HDInsight cluster.
- A Data Type Conversion transformation that ensures data type compatibility and prevents field truncation issues.
- An OLE DB Destination that loads the extracted data into the staging.IISLog table.
If there is a substantial volume of data in the Hive table the extraction might take a long time, so it might be necessary to set the timeout on the IIS source component to a high value. Alternatively, the refresh of data into the data warehouse could be carried out more often, such as every week. You could create a separate Hive table for the extraction and automate running a HiveQL statement every day to copy the log entries for that day to the extraction table after they’ve been uploaded into the iis_page_hits table. At the end of the week you perform the extraction from the (smaller) extraction table, and then empty it ready for next week’s data.
Every SSIS Data Flow includes an Expressions property that can be used to dynamically assign values to properties of the data flow or the components it contains, as shown in Figure 3.
Figure 3 - Assigning a property expression
In this data flow, the SqlCommand property of the Hive Table ODBC source component is assigned using the following expression.
"SELECT logdate, REGEXP_REPLACE(cs_uri_query, 'productid=', '') productid,
SUM(sc_bytes) sc_bytes, SUM(cs_bytes) cs_bytes, COUNT(*) PageHits
FROM iis_log WHERE logdate > '" + @[User::HighWaterMark] + "'
GROUP BY logdate, REGEXP_REPLACE(cs_uri_query, 'productid=', '')"
This expression consists of a HiveQL query to extract the required data, combined with the value of the HighWaterMark SSIS variable to filter the data being extracted so that only rows with a logdate value greater than the highest ones already in the data warehouse are included.
Based on the log files for the Adventure Works e-commerce site, the cs_uri_query values in the web server log file contain either the value “-” (for requests with no query string) or “productid=product-code” (where product-code is the product code for the requested product). The HiveQL query includes a regular expression that parses the cs_uri_query value and removes the text “productid=”. The Hive query therefore generates a results set that includes a productid column, which contains either the product code value or “-”.
Note
The query string example in this scenario is deliberately simplistic in order to reduce complexity. In a real-world solution, parsing query strings in a web server log may require a significantly more complex expression, and may even require a user-defined Java function.
After the data has been extracted it flows to the Data Type Conversion transformation, which converts the logdate and productid values to 50-character Unicode strings. The rows then flow to the Staging Table destination, which loads them into the staging.IISLog table.
Loading staged data into a fact table
After the data has been staged, a second SSIS package is used to load it into the fact table in the data warehouse. This package consists of a control flow that contains a single Execute SQL task to execute the following Transact-SQL code.
INSERT INTO dbo.FactIISLog
(LogDateKey, ProductKey, BytesSent, BytesReceived, PageHits)
SELECTd.DateKey, p.ProductKey, s.BytesSent, s.BytesReceived, s.PageHits
FROM staging.IISLog s JOIN DimDate d ON s.LogDate = d.FullDateAlternateKey
JOIN DimProduct p ON s.productid = p.ProductAlternateKey
AND (p.StartDate <= s.LogDate AND (p.EndDate IS NULL OR p.EndDate > s.LogDate))
ORDER BY d.DateKey;
The code inserts rows from the staging.IISLog table into the dbo.FactIISLog table, looking up the appropriate dimension keys for the date and product dimensions. The surrogate key for the date dimension is an integer value derived from the year, month, and day. The LogDateKey value extracted from HDInsight is a string in the format “YYYY-MM-DD”. SQL Server can implicitly convert values in this format to the Date data type, so a join can be made to the FullDateAlternateKey column in the DimDate table to find the appropriate surrogate DateKey value. The DimProduct dimension table includes a row for “None” (with the ProductAlternateKey value “-”), and one or more rows for each product.
Each product row has a unique ProductKey value (the surrogate key) and an alternate key that matches the product code extracted by the Hive query. However, because Product is a slowly changing dimension there may be multiple rows with the same ProductAlternateKey value, each representing the same product at a different point in time. When loading the product data, the appropriate surrogate key for the version of the product that was current when the web page was requested must be looked up based on the alternate key and the start and end date values associated with the dimension record, so the join for the DimProduct table in the Transact-SQL code includes a clause to check for a StartDate value that is before the log date, and an EndDate value that is either after the log date or null (for the record representing the current version of the product member).
Using the data
Now that the IIS log data has been summarized by HDInsight and loaded into a fact table in the data warehouse, it can be used in corporate data models and reports throughout the organization. The data has been conformed to the dimensional model of the data warehouse, and this deep integration enables business users to intuitively aggregate IIS activity across dates and products. For example, Figure 4 shows how a user can create a Power View visualization in Excel that includes sales and page view information for product categories and individual products.
Figure 4 - A Power View report showing data that is integrated in the data warehouse
The inclusion of IIS server log data from HDInsight in the data warehouse enables it to be used easily throughout the entire BI ecosystem, in managed corporate reports and self-service BI scenarios. For example, a business user can use Report Builder to create a report that includes web site activity as well as sales revenue from a single dataset, as shown in Figure 5.
Figure 5 - Creating a self-service report with Report Builder
When published to an SSRS report server, this self-service report provides an integrated view of business data, as shown in Figure 6.
Figure 6 - A self-service report based on a data warehouse that contains data from HDInsight