Collaborative self-service BI
From: Developing big data solutions on Microsoft Azure HDInsight
In addition to the enterprise BI solution at Adventure Works, described in Scenario 4: BI integration, business analysts use Excel and SharePoint Server to create and share their own analytical models. This self-service BI approach has become increasingly useful at Adventure Works because it makes it easier for business analysts to rapidly develop custom reports that combine internal and external data, without over-burdening the IT department with requests for changes to the data warehouse. The company has therefore added the Power BI service to its corporate Office 365 subscription, and encourages business analysts to use it to share insights gained from their analysis.
Custom big data processing
The business analysts at Adventure Works are not software developers, and do not have the expertise to create custom map/reduce components using Java or C#. However, some of the senior analysts are proficient in higher-level Hadoop languages such as Hive and Pig, and they can use these languages to generate analytical datasets from source data in HDInsight. This ability to perform custom processing means that the business analysts can store analytical datasets as files in Azure blob storage without relying on the HDInsight cluster remaining available to service Hive queries.
For example, a senior business analyst can use the following Pig script to generate a result set that is saved as a file in Azure blob storage.
Logs = LOAD '/data' USING PigStorage(' ')
AS (log_date, log_time, c_ip, cs_username, s_ip, s_port, cs_method, cs_uri_stem, cs_uri_query,
sc_status, sc_bytes:int, cs_bytes:int, time_taken:int, cs_user_agent, cs_referrer);
CleanLogs = FILTER Logs BY SUBSTRING(log_date, 0, 1) != '#';
GroupedLogs = GROUP CleanLogs BY log_date;
GroupedTotals = FOREACH GroupedLogs GENERATE group, COUNT(CleanLogs) AS page_hits,
SUM(CleanLogs.sc_bytes) AS bytes_received, SUM(CleanLogs.cs_bytes) AS bytes_sent;
DailyTotals = FOREACH GroupedTotals GENERATE FLATTEN(group) as log_date, page_hits, bytes_received, bytes_sent;
SortedDailyTotals = ORDER DailyTotals BY log_date ASC;
STORE SortedDailyTotals INTO '/webtraffic';
Running this Pig script produces a file named part-r-00000 in the /webtraffic folder in the Azure blob storage container used by the HDInsight cluster. The file contains the date, total page hits, and total bytes received and sent, for each day. This file will be persisted even if the HDInsight cluster is deactivated.
Creating and sharing a query
After processing the source data in HDInsight, business analysts can use Power Query in Excel to retrieve the results as a file from the Azure storage container used by the HDInsight cluster. Power Query provides a query editing wizard that makes it easy to connect to a range of data sources (including Azure storage), filter and shape the data that is retrieved, and import it into a worksheet or data model. Additionally, business analysts can combine multiple queries that retrieve data from different sources to generate mashup datasets for analysis.
Figure 1 shows how Power Query can be used to import the output file generated by the Pig script shown previously.
Figure 1 - Using Power Query to retrieve Pig output
Having defined a query to retrieve the output generated by the Pig script, the business analyst can sign into Power BI for Office 365 and share the query with other users in the organization, as shown in Figure 2.
Figure 2 - Sharing a Query
After the query has been shared, other users in the organization can discover the results through the Power BI Online Search feature, which enables users to search public datasets that are curated by Microsoft as well as organizational datasets that are curated by data stewards within the organization. Figure 3 shows an online search for the term page hits. The results include the Adventure Works Web Traffic query that was shared previously.
Figure 3 - Discovering data with Online Search
Self-service analysis with discovered data
Having discovered the dataset, a business user can use Power Query to refine and import it into an Excel data model, combine it with other data, and include it in analysis and reports (as discussed in Report level Integration). For example, Figure 4 shows a Power View report based on data imported from a shared query based on output of the Pig script discussed previously.
Figure 4 - A Power View report based on data imported from a shared query
To share the insights gained from the data, business users can publish Excel workbooks that contain PowerPivot data models and Power View visualizations as reports in a Power BI site, as shown in Figure 5.
Figure 5 - A report in a Power BI site
Users can view and interact with reports in a Power BI site using a web browser or the Windows Store app, which makes Power BI reports available on tablets and other touch devices running Windows 8 or Windows RT. In addition, the Q&A feature of Power BI sites enables users to query data models in shared workbooks using natural language queries. For example, users can generate data visualizations by entering a query such as “How many page hits were then in April?” or “Show average bytes by month.” Power BI automatically chooses an appropriate way to visualize the data, but users can override this in queries such as “Show average page hits and average sales amount by month as line chart.” Figure 6 shows how a business user can use Q&A in a Power BI site.
Figure 6 - Using Q&A in a Power BI site
Self-Service BI offers significant advantages for organizations that want to empower users to explore big data. It minimizes the burden on IT, and increases the agility of the organization by reducing the time it takes to incorporate the results of big data processing into data models and reports. Self-service solutions for sharing and discovering data and insights through tools like Power BI for Office 365 can complement more formally managed enterprise BI infrastructure based on SQL Server technologies such as Analysis Services, Reporting Services, and Integration Services, and HDInsight can be used to integrate big data processing results into an organization’s BI in multiple ways.