Report level integration
From: Developing big data solutions on Microsoft Azure HDInsight
Now that the log data is encapsulated in a Hive table, HiveQL queries can be used to summarize the log entries and extract aggregated values for reporting and analysis. Initially, business analysts at Adventure Works want to try to find a relationship between the number of website hits (obtained from the web server log data in HDInsight) and the number of items sold (available from the enterprise data warehouse). Since only the business analysts require this combined data, there is no need at this stage to integrate the web log data from HDInsight into the entire enterprise BI solution. Instead, a business analyst can use PowerPivot to create a personal data model in Excel specifically for this mashup analysis.
Adding data from a Hive table to a PowerPivot model
The business analyst starts with the PowerPivot model shown in Figure 5 in the topic Scenario 4: BI integration, which includes a Date table and an Internet Sales table based on the DimDate and FactInternetSales tables in the data warehouse. To add IIS log data from HDInsight the business analyst uses an ODBC connection based on the Microsoft Hive ODBC driver, as shown in Figure 1.
Figure 1 - Creating an ODBC connection to Hive on HDInsight
The ODBC connection to the HDInsight cluster is typically defined in a data source name (DSN) on the local computer, which makes it easy to define a connection for programs that will access data in the cluster. The DSN encapsulates a connection string such as this:
DRIVER={Microsoft Hive ODBC Driver};Host=<cluster_name>.azurehdinsight.net;Port=443; Schema=default;
RowsFetchedPerBlock=10000;HiveServerType=2;AuthMech=6;UID=UserName;PWD=Password;DefaultStringColumnLength=4000
After the connection has been defined and tested, the business analyst uses the following HiveQL query to create a new table named Page Hits that contains aggregated log data from HDInsight.
SELECT logdate, COUNT(*) hits FROM iis_log GROUP BY logdate
This query returns a single row for each distinct date that has log entries, along with a count of the number of page hits that were recorded on that date. The logdate values in the underlying Hive table are defined as text, but the yyyy-mm-dd format of the text values means that the business analyst can simply change the data type for the column in the PowerPivot table to Date; making it possible to create a relationship that joins the logdate column in the Page Hits table to the Date column in the Date table, as shown in Figure 2.
Figure 2 - Creating a relationship in PowerPivot
Note
When you are designing Hive queries that return tables you want to integrate with your BI system, you should plan ahead by considering the appropriate format for columns that contain key values you will use in the relationships with existing tables.
The business analyst can now use Excel to analyze the data and try to correlate web page site activity in the form of page hits with sales transactions in terms of the number of units sold. Figure 3 shows how the business analyst can use Power View in Excel to visually compare page hits and sales over a six month period and by quarter, determine weekly patterns of website activity, and filter the visualizations to show comparisons for a specific weekday.
Figure 3 - Using Power View in Excel to analyze data from HDInsight and the data warehouse
By integrating IIS log data from HDInsight with enterprise BI data at the report level, business analysts can create mashup reports and analyses without impacting the BI infrastructure used for corporate reporting. However, after using this report-level integration to explore the possibilities of using IIS log data to increase understanding of the business, it has become apparent that the log data could be useful to a wider audience of users than just business analysts, and for a wider range of business processes. This can be achieved through Corporate data model integration.