Scenario 4: BI integration
From: Developing big data solutions on Microsoft Azure HDInsight
This scenario explores ways in which big data batch processing with HDInsight can be integrated into a business intelligence (BI) solution in a corporate environment. The emphasis in this scenario is on the challenges and techniques associated with integrating data from HDInsight into a BI ecosystem based on Microsoft SQL Server and Office technologies. This includes integration at the report, corporate data model, and data warehouse levels of an enterprise BI solution, as well as how insights from big data analysis in HDInsight can be shared in a self-service BI solution built on Office 365 and Power BI.
The scenario includes and demonstrates:
- Introduction to Adventure Works
- The analytical goals
- The HDInsight solution
- Report level integration
- Corporate data model integration
- Data warehouse integration
- Collaborative self-service BI
The data ingestion and processing elements of the example used in this scenario have been deliberately kept simple in order to focus on the integration techniques. In a real-world solution the challenge of obtaining the source data, loading it to the HDInsight cluster, and using map/reduce code, Pig, or Hive to process it before consuming it in a BI infrastructure are likely to be more complex than described in this scenario.
Introduction to Adventure Works
Adventure Works is a fictional company that manufactures and sells bicycles and cycling equipment. Adventure Works sells its products through an international network of resellers and also through its own e-commerce site, which is hosted in an Internet Information Services (IIS)-based datacenter.
As a large-scale multinational company, Adventure Works has already made a considerable investment in data and BI technologies to support formal corporate reporting, as well as for business analysis. In addition to an enterprise BI solution that includes reporting and dashboards, Adventure Works has empowered business analysts to perform self-service analysis and reporting using Excel, and has recently added the Power BI service to the company’s Office 365 subscription.
The existing enterprise BI solution
The BI solution at Adventure Works is built on an enterprise data warehouse hosted in SQL Server Enterprise Edition. SQL Server Integration Services (SSIS) is used to refresh the data warehouse with new and updated data from line of business systems. This includes sales transactions, financial accounts, and customer profile data. The high-level architecture of the Adventure Works BI solution is shown in Figure 1.
Figure 1 - The Adventure Works enterprise BI solution
The enterprise data warehouse is based on a dimensional design in which multiple dimensions of the business are conformed across aggregated measures. The dimensions are implemented as dimension tables, and the measures are stored in fact tables at the lowest common level of grain (or granularity). A partial schema of the data warehouse is shown in Figure 2.
Figure 2 - Partial data warehouse schema
Note
Figure 2 shows only a subset of the data warehouse schema. Some tables and columns have been omitted for clarity.
The data warehouse supports a corporate data model, which is implemented as a SQL Server Analysis Services (SSAS) tabular database. This data model provides a cube for analysis in Excel and reporting in SQL Server Reporting Services (SSRS), and shown in Figure 3.
Figure 3 - An SSAS tabular data model
The SSAS data model supports corporate reporting, including a sales performance dashboard that shows a scorecard of key performance indicators (KPIs), as shown in Figure 4. Additionally, the data model is used by Adventure Works business users to create PivotTables and PivotCharts in Excel.
Figure 4 - An SSRS report based on the SSAS data model
In addition to the managed reporting and analysis supported by the corporate data model, Adventure Works has empowered business analysts to engage in self-service BI activities, including the creation of SSRS reports with Report Builder and the use of PowerPivot in Excel to create personal data models directly from tables in the data warehouse, as shown in Figure 5.
Figure 5 - Creating a personal data model with PowerPivot
The analytical goals
The existing BI solution provides comprehensive reporting and analysis of important business data. The IIS web server farm hosting the e-commerce site generates log files that are retained and used for troubleshooting purposes, but until now these log files have never been considered a viable source of business information. The web servers generate a new log file each day, and each log contains details of every request received and processed by the web site. This provides a huge volume of data that could potentially provide useful insights into customers’ activity on the e-commerce site.
A small subset of the log file data is shown in the following example.
#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2008-01-01 09:15:23
#Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer)
2008-01-01 00:01:00 198.51.100.2 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100
2008-01-01 00:04:00 198.51.100.5 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
2008-01-01 00:05:00 198.51.100.6 - 192.0.0.1 80 GET /product.aspx productid=BL-2036 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
2008-01-01 00:06:00 198.51.100.7 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
/product.aspx productid=CN-6137 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
The ability to analyze the log data and summarize website activity over time would help the business to measure the amount of data transferred during web requests, and potentially correlate web activity with sales transactions to better understand trends and patterns in e-commerce sales. However, the large volume of log data that must be processed in order to extract these insights has prevented the company from attempting to include the log data in the enterprise data warehouse.
The company has recently decided to use HDInsight to process and summarize the log data so that it can be reduced to a more manageable volume, and integrated into the enterprise BI ecosystem. The developers will integrate the results of the processing at all three levels of their existing BI system, as shown in Figure 6, and also enable self-service BI through Power BI for Office 365.
Figure 6 - The three levels for integration of the results into the existing BI system.