Power Query

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

The Power Query add-in enhances Excel by providing a comprehensive interface for querying a wide range of data sources. It can also be used to perform data enhancements such as cleansing data by replacing values, and combining data sets from different sources. Power Query includes a data source provider for HDInsight, which enables users to browse the folders in Azure blob storage that are associated with an HDInsight cluster. You can download the Power Query add-in from the Office website.

By connecting directly to blob storage, users can import data from files such as those generated by map/reduce jobs and Pig scripts, and import the underlying data files associated with Hive tables—even if the cluster is not running or has been deleted. This enables organizations to consume the results of HDInsight processing, while significantly reducing costs if no further HDInsight processing is required.

Note

Keeping an HDInsight cluster running when it is not executing queries just so that you can access the data incurs charges to your Azure account. If you are not using the cluster, you can close it down but still be able to access the data at any time using Power Query in Excel, or any other tool that can access Azure blob storage.

With the Power Query add-in installed, you can use the From Other Sources option on the Power Query tab on the Excel ribbon to import data from HDInsight. You must specify the account name and key of the Azure blob store, not the HDInsight cluster itself. After connecting to the Azure blob store you can select a data file, convert its contents to a table by specifying the appropriate delimiter, and modify the data types of the columns before importing it into a worksheet, as shown in Figure 1.

Figure 1 - Importing data from HDInsight with Power Query

Figure 1 - Importing data from HDInsight with Power Query

The imported data can be added to the workbook data model, or analyzed directly in the worksheet. The following table describes specific considerations for using Power Query in the HDInsight use cases and models described in this guide.

Use case

Considerations

Iterative data exploration

Power Query is a good choice when HDInsight data processing techniques such as map/reduce jobs or Pig scripts generate files that contain the results to be analyzed or reported. The HDInsight cluster can be deleted after the processing is complete, leaving the results in Azure blob storage ready to be consumed by business users in Excel. With the addition of a Power BI for Office 365 subscription, queries that return data from files in Azure blob storage can be shared—making big data processing results discoverable by other Excel users in the enterprise through the Online Search feature.

Data warehouse on demand

When HDInsight is used to implement a basic data warehouse it usually includes a schema of Hive tables that are queried over ODBC connections. While it is technically possible to use Power Query to import data from the files on which the Hive tables are based, more complex Hive solutions that include partitioned tables would be difficult to consume this way.

ETL automation

Most ETL scenarios are designed to transform big data into a suitable structure and volume for storage in a relational data source for further analysis and querying. It is unlikely that Power Query would be used to consume data files from the blob storage associated with the HDInsight cluster, though it may be used to consume data from the relational data store loaded by the ETL process.

BI integration

Importing data from a file in Azure blob storage and combining it with data from a BI data source (such as a relational data warehouse or corporate data model) is an effective way to accomplish report-level integration with an enterprise BI solution. Additionally, users can import the datasets retrieved by Power Query into a PowerPivot data model, and publish workbooks containing data models and Power View visualizations to Power BI for Office 365 in a self-service BI scenario.

Guidelines for using Power Query with HDInsight

When using Power Query to consume output files from HDInsight, consider the following guidelines:

  • Ensure that the big data processing jobs you use to generate data for analysis store their output in appropriately named folders. This makes it easier for Power Query users to find output files with generic names such as part-r-00000.
  • You can apply filters and sophisticated transformations to data in Power Query queries while importing the output file from a big data processing job. However, you should generally try to perform as much as possible of the required filtering and shaping within the big data processing job itself in order to simplify the query that Excel users need to create.
  • Ensure that Power Query users are familiar with the schema of output files generated by big data processing jobs. Output files generally do not include column headers.
  • When a big data processing job generates multiple output files you can use multiple Power Query queries to combine the data.

Next Topic | Previous Topic | Home | Community