SQL Server Analysis Services
From: Developing big data solutions on Microsoft Azure HDInsight
In many organizations, enterprise BI reporting and analytics is a driver for decision making and business processes that involve multiple users in different parts of the business. In these kinds of scenarios, analytical data is often structured into a corporate data model (often referred to colloquially as a cube) and accessed from multiple client applications and reporting tools. Corporate data models help to simplify the creation of analytical mashups and reports, and ensure consistent values are calculated for core business measures and key performance indicators across the organization.
On the Microsoft data platform, SQL Server Analysis Services (SSAS) provides a mechanism for creating and publishing corporate data models that can be used as a source for PivotTables in Excel, reports in SQL Server Reporting Services, dashboards in SharePoint Server PerformancePoint Services, and other BI reporting and analysis tools.
You can install SSAS in one of two modes; Tabular mode or Multidimensional mode. Tabular mode supports loading data with the ODBC driver, which means that you can connect to the results of a Hive query to analyze the data. However, the Multidimensional mode does not support ODBC, and so you must either import the data into SQL Server or configure a linked server to pass through the query and results.
Tabular mode
When installed in Tabular mode, SSAS can be used to host tabular data models that are based on the xVelocity in-memory analytics engine. These models use the same technology and design as PowerPivot data models in Excel but can be scaled to handle much larger volumes of data, and they can be secured using enterprise-level role based security. You create tabular data models in the Visual Studio-based SQL Server Data Tools development environment, and you can choose to create the data model from scratch or import an existing PowerPivot for Excel workbook.
Because Tabular models support ODBC data sources, you can easily include data from Hive tables in the data model. You can use HiveQL queries to pre-process the data as it is imported in order to create the schema that best suits your analytical and reporting goals. You can then use the modeling capabilities of SQL Server Analysis Services to create relationships, hierarchies, and other custom model elements to support the analysis that users need to perform.
Figure 1 shows a Tabular model for the weather data used in this section of the guide.
Figure 1 - A Tabular SQL Server Analysis Services data model
The following table describes specific considerations for using tabular SSAS data models in the HDInsight use cases and models described in this guide.
Use case |
Considerations |
---|---|
Iterative data exploration |
If the results of HDInsight data processing can be encapsulated in Hive tables, and multiple users must perform consistent analysis and reporting of the results, an SSAS tabular model is an easy way to create a corporate data model for analysis. However, if the analysis will only be performed by a small group of specialist users, you can probably achieve this by using PowerPivot in Excel. |
Data warehouse on demand |
When HDInsight is used to implement a basic data warehouse, it usually includes a schema of Hive tables that are queried over ODBC connections. A tabular data model is a good way to create a consistent analytical and reporting schema based on the Hive tables. |
ETL automation |
If the target of the HDInsight-based ETL process is a relational database, you might build a tabular data model based on the tables in the database in order to enable enterprise-level analysis and reporting. |
BI integration |
If the enterprise BI solution already uses tabular SSAS models, you can add Hive tables to these models and create any necessary relationships to support integrated analysis of corporate BI data and big data results from HDInsight. However, if the corporate BI data warehouse is based on a dimensional model that includes surrogate keys and slowly changing dimensions, it can be difficult to define relationships between tables in the two data sources. In this case, integration at the data warehouse level may be a better solution. |
Multidimensional mode
As an alternative to Tabular mode, SSAS can be installed in Multidimensional mode. Multidimensional mode provides support for a more established online analytical processing (OLAP) approach to cube creation, and is the only mode supported by releases of SSAS prior to SQL Server 2012. Additionally, if you plan to use SSAS data mining functionality, you must install SSAS in Multidimensional mode.
Multidimensional mode includes some features that are not supported or are difficult to implement in Tabular data models, such as the ability to aggregate semi-additive measures across accounting dimensions and use international translations in the cube definition. However, although Multidimensional data models can be built on OLE DB data sources, some restrictions in the way cube elements are implemented means that you cannot use an ODBC data source. Therefore, there is no way to directly connect dimensions or measure groups in the data model to Hive tables in an HDInsight cluster.
To use HDInsight data as a source for a Multidimensional data model in SSAS you must either transfer the data from HDInsight to a relational database system such as SQL Server, or define a linked server in another SQL Server instance that can act as a proxy and pass queries through to Hive tables in HDInsight. The use of linked servers to access Hive tables from SQL Server, and techniques for transferring data from HDInsight to SQL Server, are described in the topic SQL Server database.
Figure 2 shows a Multidimensional data model in Visual Studio. This data model is based on views in a SQL Server database, which are in turn based on queries against a linked server that references Hive tables in HDInsight.
Figure 2 - A Multidimensional SQL Server Analysis Services data model
The following table describes specific considerations for using Multidimensional SSAS data models in the HDInsight use cases and models described in this guide.
Use case |
Considerations |
---|---|
Iterative data exploration |
For one-time analysis, or analysis by a small group of users, the requirement to use a relational database such as SQL Server as a proxy or interim host for the HDInsight results means that this approach involves more effort than using a tabular data model or just analyzing the data in Excel. |
Data warehouse on demand |
When HDInsight is used to implement a basic data warehouse, it usually includes a schema of Hive tables that are queried over ODBC connections. To include these tables in a multidimensional SSAS data model you would need to use a linked server in another SQL Server instance to access the Hive tables on behalf of the data model, or transfer the data from HDInsight to a relational database. Unless you specifically require advanced OLAP capabilities that can only be provided in a multidimensional data model, a tabular data model is probably a better choice. |
ETL automation |
If the target of the HDInsight-based ETL process is a relational database that can be accessed through an OLE DB connection, you might build a multidimensional data model based on the tables in the database to enable enterprise-level analysis and reporting. |
BI integration |
If the enterprise BI solution already uses multidimensional SSAS models that you want to extend to include data from HDInsight, you should integrate the data at the data warehouse level and base the data model on the relational data warehouse. |
Guidelines for using SQL Server Analysis Services with HDInsight
When using SSAS with HDInsight, consider the following guidelines:
- You cannot use ODBC data sources in a Multidimensional SSAS database. If you must include Hive tables in a Multidimensional model, consider defining a linked server in a SQL Server instance and adding SQL Server views that query the Hive tables to the SSAS data source.
- If you are including data from Hive tables in a Tabular data model, use an explicit ODBC connection string instead of a DSN. This will enable the data model to be refreshed when stored on a server where the DSN is not available.
- Consider the life cycle of the HDInsight cluster when scheduling data model processing. When the model is processed, it refreshes partitions from the original data sources and so you should ensure that that the HDInsight cluster and its Hive tables will be available when partitions based on them are processed.