Use case 4: BI integration

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Microsoft offers a set of applications and services to support end-to-end solutions for working with data (described in Understanding Microsoft big data solutions). The Microsoft data platform includes all of the elements that make up an enterprise level business intelligence (BI) system for organizations of any size. Organizations that already use an enterprise BI solution for business analytics and reporting can extend their analytical capabilities by using a big data solution such as HDInsight to add new sources of data to their decision making processes.

The following sections of this topic provide more information:

  • Use case and model overview
  • When to choose this model
  • Data sources
  • Output targets
  • Considerations
  • Summary of integration level scenarios and considerations

Note

The information in this section will help you to understand how you can integrate HDInsight with an enterprise BI system. However, a complete discussion of enterprise BI systems is beyond the scope of this guide.

Use case and model overview

Enterprise BI is a topic in itself, and there are several factors that require special consideration when integrating a big data solution such as HDInsight with an enterprise BI system. Figure 1 shows an overview of a typical enterprise data warehouse and BI solution.

Figure 1 - Overview of a typical enterprise data warehouse and BI implementation

Figure 1 - Overview of a typical enterprise data warehouse and BI implementation

Data flows from business applications and other external sources into a data warehouse through an ETL process. Corporate data models are then used to provide shared analytical structures such as online analytical processing (OLAP) cubes or data mining models, which can be consumed by business users through analytical tools and reports. Some BI implementations also enable analysis and reporting directly from the data warehouse, enabling advanced users such as business analysts and data scientists to create their own personal analytical data models and reports.

Many technologies and techniques can be used to integrate HDInsight with existing BI technologies. However, as a general guide there are three primary levels of integration, as shown in Figure 2.

Figure 2 - Three levels of integration for big data with an enterprise BI system

Figure 2 - Three levels of integration for big data with an enterprise BI system

The integration levels shown in Figure 2 are:

  • Report level integration. Data from HDInsight is used in reporting and analytical tools to augment data from corporate BI sources, enabling the creation of reports that include data from corporate BI sources as well as from HDInsight, and also enabling individual users to combine data from both solutions into consolidated analyses. This level of integration is typically used for creating mashups, exploring datasets to discover possible queries that can find hidden information, and for generating one off reports and visualizations.
  • Corporate data model level integration. HDInsight is used to process data that is not present in the corporate data warehouse, and the results of this processing are then added to corporate data models where they can be combined with data from the data warehouse and used in multiple corporate reports and analysis tools. This level of integration is typically used for exposing the data in specific formats to information systems, and for use in reporting and visualization tools.
  • Data warehouse level integration. HDInsight is used to prepare data for inclusion in the corporate data warehouse. The data that has been loaded is then available throughout the entire enterprise BI solution. This level of integration is typically used to create standalone tables on the same database hardware as the enterprise data warehouse, which provides a single source of enterprise data for analysis, or to incorporate the data into a dimensional schema and populate dimension and fact tables for full integration into the BI solution.

Note

An example of applying this use case and model can be found in Scenario 4: BI integration.

The following sections describe the three integration levels in more detail to help you understand the implications of your choice. They also contain guidelines for implementing each one. However, keep in mind that you don’t have to use the same integration level for all of your processes. You can use a different approach for each dataset that you extract from HDInsight, depending on the scenario and the requirements for that dataset.

Report level integration

Using a big data solution in a standalone iterative and experimental way, as described in Use case 1: Iterative exploration, can unlock information from data sources that have not yet been analyzed. However, much greater business value can be obtained by integrating the results from a big data solution with data and BI activities that are already present in the organization.

In contrast to the rigidly defined reports created by BI developers, a growing trend is a “self-service” approach. In this approach the data warehouse provides some datasets based on data models and queries defined there, but the user selects the datasets and builds a custom report. The ability to combine multiple data sources in a personal data model enables a more flexible approach to data exploration that goes beyond the constraints of a formally managed corporate data warehouse. Users can augment reports and analysis of data from the corporate BI solution with additional data from a big data solution to create a mashups that bring together data from both sources into a single, consolidated report.

You can use the following techniques to integrate HDInsight with enterprise BI data at the report level:

  • Use the Power Query add-in to download the output files generated in the cluster and open them in Excel, or import them into a database for reporting.
  • Create Hive tables in the cluster and consume them directly from Excel (including using Power Query, Power Pivot, Power View, and Power Map) or from SQL Server Reporting Services (SSRS) by using the Hive ODBC driver.
  • Download the required data as a delimited file from the cluster’s Azure blob storage container, perhaps by using PowerShell, and open it in Excel or another data analysis and visualization tool.

Corporate data model level integration

Integration at the report level makes it possible for advanced users to combine data from the cluster with existing corporate data sources to create data mashups. However, there may be some scenarios where you need to deliver combined information to a wider audience, or to users who do not have the time or ability to create their own complex data analysis solutions. Alternatively, you might want to take advantage of the functionality available in corporate data modeling platforms to add value to the insights you have gained from the information contained within the source data.

By integrating data from your big data solution into corporate data models you can accomplish both of these aims, and use the data as the basis for enterprise reporting and analytics. Integrating the output from HDInsight with your corporate data models allows you to use tools such as SQL Server Analysis Services (SSAS) to analyze the data and present it in a format that is easy to use in reports, or for performing deeper analysis.

You can use the following techniques to integrate the results into a corporate data model:

  • Create Hive tables in the cluster and consume them directly from a SSAS tabular model by using the Hive ODBC driver. SSAS in tabular mode supports the creation of data models from multiple data sources and includes an OLE DB provider for ODBC, which can be used as a wrapper around the Hive ODBC driver.
  • Create Hive tables in the cluster and then create a linked server in the instance of the SQL Server database source used by an SSAS multidimensional data model so that the Hive tables can be queried through the linked server and imported into the data model. SSAS in multidimensional mode can only use a single OLE DB data source, and the OLE DB provider for ODBC is not supported.
  • Use Sqoop or SQL Server Integration Services (SSIS) to copy the data from the cluster to a SQL Server database engine instance that can then be used as a source for an SSAS tabular or multidimensional data model.

Note that you must choose between the multidimensional and the tabular data mode when you install SQL Server Analysis Services, though you can install two instances if you need both modes.

When installed in tabular mode, SSAS supports the creation of data models that include data from multiple diverse sources, including ODBC-based data sources such as Hive tables.

When installed in multidimensional mode, SSAS data models cannot be based on ODBC sources due to some restrictions in the designers for multidimensional database objects. To use Hive tables as a source for a multidimensional SSAS model, you must either extract the data from Hive into a suitable source for the multidimensional model (such as a SQL Server database), or use the Hive ODBC driver to define a linked server in a SQL Server instance that provides “pass through” access to the Hive tables, and then use the SQL Server instance as the data source for the multidimensional model. You can download a case study that describes how to create a multidimensional SSAS model that uses a linked server in a SQL Server instance to access data in Hive tables.

Data warehouse level integration

Integration at the corporate data model level enables you to include data from a big data solution in a data model that is used for analysis and reporting by multiple users, without changing the schema of the enterprise data warehouse. However, in some cases you might want to use HDInsight as a component of the overall ETL solution that is used to populate the data warehouse, in effect using a source of data queried through HDInsight just like any other business data source, and consolidating the data from all sources into an enterprise dimensional model.

In addition, as with other data sources, it’s likely that the data import process from the cluster into the database tables will occur on a schedule so that the data is as up to date as possible. The schedule will depend on the time taken to execute the queries and perform ETL tasks prior to loading the results into the database tables.

You can use the following techniques to integrate data from HDInsight into an enterprise data warehouse:

  • Use Sqoop to copy data directly into database tables. These might be tables in a SQL Server data warehouse that you do not want to integrate into the dimensional model of the data warehouse. Alternatively, they may be tables in a staging database where the data can be validated, cleansed, and conformed to the dimensional model of the data warehouse before being loaded into the fact and dimension tables. Any firewalls located between the cluster and the target database must be configured to allow the database protocols that Sqoop uses.
  • Use PolyBase for SQL Server to copy data directly into database tables. PolyBase is a component of Microsoft Analytical Platform System (APS) and is available only on APS appliances (see PolyBase on the SQL Server website). The tables might be in a SQL Server data warehouse and you do not want to integrate them into the dimensional model of the data warehouse. Alternatively, they may be tables in a staging database where the data can be validated, cleansed, and conformed to the dimensional model of the data warehouse before being loaded into the fact and dimension tables. Any firewalls located between the cluster and the target database must be configured to allow the database protocols that PolyBase uses.
  • Create an SSIS package that reads the output file from the cluster, or uses the Hive ODBC driver to extract the data, and then validates, cleanses, and transforms it before loading it into the fact and dimension tables in the data warehouse.
  • Create a Linked Server in SQL Server that links to Hive tables in HDInsight through the Hive ODBC Driver. You can then execute SQL queries that extract the data from HDInsight. However, you must be aware of some issues such as compatible data types and some language syntax limitations. For more information see How to create a SQL Server Linked Server to HDInsight HIVE using Microsoft Hive ODBC Driver.

Note

When reading data from HDInsight you must open port 1000 on the cluster—you can do this using the management portal. For more information see Configure the Windows Firewall to Allow SQL Server Access.

When to choose this model

This model is typically suited to the following scenarios:

  • You have an existing enterprise data warehouse and BI system that you want to augment with data from outside your organization.
  • You want to explore new ways to combine data in order to provide better insight into history and to predict future trends.
  • You want to give users more opportunities for self-service reporting and analysis that combines managed business data and big data from other sources.

Data sources

The input data can be almost anything, but for the BI integration model it typically includes the following:

  • Social media data, log files, sensor data, and the output from applications that generate data files.
  • Datasets obtained from Azure Marketplace and other commercial data providers.
  • Streaming data captured, filtered, and processed through a suitable tool or framework (see Collecting and loading data into HDInsight).

Output targets

The results from your HDInsight queries can be visualized using any of the wide range of tools that are available for analyzing data, combining it with other datasets, and generating reports. Typical examples for the BI integration model are:

  • SQL Server Reporting Services (SSRS).
  • SharePoint Server or other information management systems.
  • Business performance dashboards such as PerformancePoint Services in SharePoint Server.
  • Interactive analytical tools such as Excel, Power Query, Power Pivot, Power View, and Power Map.
  • Custom or third party analysis and visualization tools.

Note

You will see more details of these tools in Consuming and visualizing data from HDInsight.

Considerations

There are some important points to consider when choosing the BI integration model:

  • This model is typically used when you want to:
    • Integrate external data sources with your enterprise data warehouse.
    • Augment the data in your data warehouse with external data.
    • Update the data at scheduled intervals or on demand.
  • ETL processes in a data warehouse usually execute on a scheduled basis to add new data to the warehouse. If you intend to integrate the results from HDInsight into your data warehouse so that the information stored there is updated, you must consider how you will automate and schedule the tasks of executing the query and importing the results.
  • You must ensure that data imported from your HDInsight solution contains valid values, especially where there are typically multiple common possibilities (such as in street addresses and city names). You may need to use a data cleansing mechanism such as Data Quality Services to force such values to the correct leading value.
  • Most data warehouse implementations use slowly changing dimensions to manage the history of values that change over time. Different versions of the same dimension member have the same alternate key but unique surrogate keys, and so you must ensure that data imported into the data warehouse tables uses the correct surrogate key value. This means that you must either:
    • Use some complex logic to match the business key in the source data (which will typically be the alternate key) with the correct surrogate key in the data model when you join the tables. If you simply join on the alternate key, some loss of data accuracy may occur because the alternate key is not guaranteed to be unique.
    • Load the data into the data warehouse and conform it to the dimensional data model, including setting the correct surrogate key values.

One of the difficult tasks in full data warehouse integration is matching rows imported from a big data solution to the correct dimension members in the data warehouse dimension tables. You must use a combination of the alternate key and the point in time to which the imported row relates in order to look up the correct surrogate key. This key can differ based on the date when changes were made to the original entity. For example, a product may have more than one surrogate key over its lifetime, and the imported data must match the correct version of this key.

Summary of integration level scenarios and considerations

The following table summarizes the primary considerations for deciding whether to adopt one of the three integration approaches described in this topic. Each approach has its own benefits and challenges.

Integration Level

Typical Scenarios

Considerations

None

No enterprise BI solution currently exists.

The organization wants to evaluate HDInsight and big data analysis without affecting current BI and business operations.

Business analysts in the organization want to explore external or unmanaged data sources that are not included in the managed enterprise BI solution.

Open-ended exploration of unmanaged data could produce a lot of business information, but without rigorous validation and cleansing the data might not be completely accurate.

Continued experimental analysis may become a distraction from the day-to-day running of the business, particularly if the data being analyzed is not related to core business data.

Report Level

A small number of individuals with advanced self-service reporting and data analysis skills need to augment corporate data that is available from the enterprise BI solution with big data from other sources.

A single report combining corporate data and some external data is required for one-time analysis.

It can be difficult to find common data values on which to join data from multiple sources to gain meaningful comparisons.

Integrated data in a report can be difficult to share and reuse in different ways, though the ability to share Power Pivot workbooks in SharePoint Server may provide a solution for small to medium sized groups of users.

Corporate Data Model Level

A wide audience of business users must consume multiple reports that rely on data from both the corporate data warehouse and big data. These users may not have the skills necessary to create their own reports and data models.

Data from a big data solution is required for specific business logic in a corporate data model that is used for reports and dashboards.

It can be difficult to find common data values on which to join data from multiple sources.

Integrating data generated by HDInsight into tabular SSAS models can be accomplished easily through the Hive ODBC driver. Integration with multidimensional models is more challenging.

Data Warehouse Level

The organization wants a complete managed BI platform for reporting and analysis that includes business application data sources as well as big data sources.

The desired level of integration between data from HDInsight and existing business data, and the required tolerance for data integrity and accuracy across all data sources, necessitates a formal dimensional model with conformed dimensions.

Data warehouses typically have demanding data validation requirements.

Loading data into a data warehouse that includes slowly changing dimensions with surrogate keys can require complex ETL processes.

Next Topic | Previous Topic | Home | Community