Tutorial: Query Apache Hive with ODBC and PowerShell
Article
Microsoft ODBC drivers provide a flexible way to interact with different kinds of data sources, including Apache Hive. You can write code in scripting languages like PowerShell that use the ODBC drivers to open a connection to your Hive cluster, pass a query of your choosing, and display the results.
In this tutorial, you'll do the following tasks:
Download and install the Microsoft Hive ODBC driver
Create an Apache Hive ODBC data source linked to your cluster
Query sample information from your cluster using PowerShell
If you don’t have an Azure subscription, create a free account before you begin.
Prerequisites
Before you begin this tutorial, you must have the following items:
An Interactive Query cluster on HDInsight. To create one, see Get started with Azure HDInsight. Select Interactive Query as the cluster type.
The following steps show you how to create an Apache Hive ODBC data source.
From Windows, navigate to Start > Windows Administrative Tools > ODBC Data Sources (32-bit)/(64-bit). An ODBC Data Source Administrator window opens.
From the User DSN tab, select Add to open the Create New Data Source window.
Select Microsoft Hive ODBC Driver, and then select Finish to open the Microsoft Hive ODBC Driver DSN Setup window.
Type or select the following values:
Property
Description
Data Source Name
Give a name to your data source
Host(s)
Enter CLUSTERNAME.azurehdinsight.net. For example, myHDICluster.azurehdinsight.net
Port
Use 443.
Database
Use default.
Mechanism
Select Windows Azure HDInsight Service
User Name
Enter HDInsight cluster HTTP user username. The default username is admin.
Password
Enter HDInsight cluster user password. Select the checkbox Save Password (Encrypted).
Optional: Select Advanced Options.
Parameter
Description
Use Native Query
When it's selected, the ODBC driver does NOT try to convert TSQL into HiveQL. Use this option only if you're 100% sure that you're submitting pure HiveQL statements. When connecting to SQL Server or Azure SQL Database, you should leave it unchecked.
Rows fetched per block
When fetching a large number of records, tuning this parameter may be required to ensure optimal performances.
The data type lengths and precisions may affect how data is returned. They cause incorrect information to be returned because of loss of precision and truncation.
Select Test to test the data source. When the data source is configured correctly, the test result shows SUCCESS.
Select OK to close the Test window.
Select OK to close the Microsoft Hive ODBC Driver DSN Setup window.
Select OK to close the ODBC Data Source Administrator window.
Query data with PowerShell
The following PowerShell script is a function that ODBC to query a Hive cluster.
The following code snippet uses the function above to execute a query on the Interactive Query cluster that you created at the beginning of the tutorial. Replace DATASOURCENAME with the Data Source Name that you specified on the Microsoft Hive ODBC Driver DSN Setup screen. When prompted for credentials, enter the username and password that you entered under Cluster login username and Cluster login password when you created the cluster.
PowerShell
$dsn = "DATASOURCENAME"$query = "select count(distinct clientid) AS total_clients from hivesampletable"Get-ODBC-Data -query$query -dsn$dsn
Clean up resources
When no longer needed, delete the resource group, HDInsight cluster, and storage account. To do so, select the resource group where the cluster was created and click Delete.
Next steps
In this tutorial, you learned how to use the Microsoft Hive ODBC driver and PowerShell to retrieve data from your Azure HDInsight Interactive Query cluster.
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.