Bagikan melalui


Analyzing Azure Table Storage data with HDInsight

HDInsight was optimized from the start to be able to quickly analyze data on Azure's blob storage service using Hadoop by using the WASB file system to expose the data there as a native Hadoop file system. But the spirit of Hadoop has always been to be able to analyze data wherever it is, so in this post I'll show you how to analyze data residing the more structured Azure table storage service. This would enable you to analyze tables that have millions/billions of rows in them using Hadoop/Hive to distribute that analysis to many compute nodes in Azure.

Prerequisites

To follow along in this post, you'll need to:

  1. Have a valid Azure subscription.
  2. Install Azure PowerShell and connect to your subscription - follow this article for details on how to do that.
  3. Have a storage account in your subscription that's selected as the default in your PowerShell session.
  4. Have git and Maven installed and in your PATH in your PowerShell session.

Creating a table to analyze

You can skip this step if you already have a table you want to analyze (and hey if you're reading this chances are good that this is the case). But just to get a quick-and-dirty example going, here's is a simple PowerShell script to create an example table for us to analyze:

 function InsertRow($table, [String]$partitionKey, [String]$rowKey, [int]$intValue)
{
  $entity = New-Object "Microsoft.WindowsAzure.Storage.Table.DynamicTableEntity" $partitionKey, $rowKey
  $entity.Properties.Add("IntValue", $intValue)
  $insertResult = $table.CloudTable.Execute(`
    [Microsoft.WindowsAzure.Storage.Table.TableOperation]::Insert($entity))
}
$exampleTable = New-AzureStorageTable tobehadooped
for ($p = 1; $p -le 10; $p++)
{
  for ($r = 1; $r -le 100; $r++)
  {
    InsertRow $exampleTable "P$p" "R$r" $r
  }
}
        

Setting up the cluster

Now that we have data we want to analyze, let's setup an HDInsight cluster to analyze it. In this tutorial I'll be using the extensions in my github project azure-tables-hadoop to be able to access Azure tables from Hive, so we'll need to get it and package it:

Update (6/27/2014): Since writing this post HDInsight and Hive have advanced forward in a not very backwards-compatible way. I've updated my git project so now it works with HDInsight 3.1 (version 0.0.3 of my jar). If you want the HDInsight 2.1-compatible version please git checkout hdp1compat before the mvn package step below. And if you want the HDInsight 3.0-compatible version please git checkout Hive12 instead.

 git clone https://github.com/mooso/azure-tables-hadoop.git
cd .\azure-tables-hadoop
mvn package
        

After that, we need to create a cluster with this JAR accessible to Hive. To do that, we'll use the AdditionalLibraries option when creating the cluster that's documented here. So we'll need to create a container to put this JAR in, put it there, then create the cluster that points to it:

 # Please customize the lines below to choose your own user name, password and cluster name
$creds = New-Object System.Management.Automation.PSCredential 'myUser',`
 ('Please ch00se a different password :)' | ConvertTo-SecureString -force -asplaintext)
$clusterName = "mycluster"

$hiveLibsContainer = New-AzureStorageContainer hiveontableslib
$azureJar = ".\target\microsoft-hadoop-azure-0.0.1.jar"
$azureJarBlob = Set-AzureStorageBlobContent -Container $hiveLibsContainer.Name `
  -Blob $(Split-Path $azureJar -Leaf) -File $azureJar -Force
$hiveConf = New-Object $('Microsoft.WindowsAzure.Management.' + `
  'HDInsight.Cmdlet.DataObjects.AzureHDInsightHiveConfiguration')
$hiveConf.AdditionalLibraries = New-Object $('Microsoft.WindowsAzure' + `
  '.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightDefaultStorageAccount')
$storageAccount = $(Get-AzureSubscription -Current).CurrentStorageAccountName
$hiveConf.AdditionalLibraries.StorageAccountName ="$storageAccount.blob.core.windows.net"
$hiveConf.AdditionalLibraries.StorageAccountKey = (Get-AzureStorageKey $storageAccount).Primary
$hiveConf.AdditionalLibraries.StorageContainerName = $hiveLibsContainer.Name
$clusterContainer = New-AzureStorageContainer $clusterName
$clusterConf = New-AzureHDInsightClusterConfig -ClusterSizeInNodes 1 |
  Set-AzureHDInsightDefaultStorage -StorageAccountName $storageAccount `
  -StorageAccountKey $(Get-AzureStorageKey $storageAccount).Primary -StorageContainerName $clusterContainer.Name
$clusterConf = $clusterConf | Add-AzureHDInsightConfigValues -Hive $hiveConf
$cluster = $clusterConf | New-AzureHDInsightCluster -Credential $creds `
  -Name $clusterName -Location $(Get-AzureStorageAccount $storageAccount).Location
        

If all goes well, this should take around 10 minutes then you'll have a newly minted single-node HDInsight cluster to play with.

Exploring the data using Hive

The easiest way to explore this data is to use Hive with its SQL-like query language (you can also use Map-Reduce directly or use Pig or other Hadoop ecosystem components, but I'll ignore those for this post). The main trick we use to expose Azure tables to Hive is to have a custom Storage Handler for it, so we need to use the STORED BY clause to specify that the table data resides in an Azure table. Then we use the TBLPROPERTIES clause to specify the information about the table.

 $hiveQuery = "CREATE EXTERNAL TABLE ExampleTable(IntValue int)
 STORED BY 'com.microsoft.hadoop.azure.hive.AzureTableHiveStorageHandler'
 TBLPROPERTIES(
  ""azure.table.name""=""$($exampleTable.Name)"",
  ""azure.table.account.uri""=""https://$storageAccount.table.core.windows.net"",
  ""azure.table.storage.key""=""$((Get-AzureStorageKey $storageAccount).Primary)"");"
Out-File -FilePath .\HiveCreateTable.q -InputObject $hiveQuery -Encoding ascii
$hiveQueryBlob = Set-AzureStorageBlobContent -File .\HiveCreateTable.q -Blob "queries/HiveCreateTable.q" `
  -Container $clusterContainer.Name -Force
$createTableJobDefinition = New-AzureHDInsightHiveJobDefinition -QueryFile /queries/HiveCreateTable.q
$job = Start-AzureHDInsightJob -JobDefinition $createTableJobDefinition -Cluster $cluster.Name
Wait-AzureHDInsightJob -Job $job
        

This should create the Hive table for us, so now we can finally explore it! For example, to get the average of the IntValue column:

 $avgJobDefinition = New-AzureHDInsightHiveJobDefinition -Query "SELECT AVG(IntValue) FROM ExampleTable"
$job = Start-AzureHDInsightJob -JobDefinition $avgJobDefinition -Cluster $cluster.Name
Wait-AzureHDInsightJob -Job $job
Get-AzureHDInsightJobOutput -JobId $job.JobId -Cluster $cluster.Name
        

Behind the scenes

Let me air out the dirty laundry and explain how this all works. The main way I expose Azure tables data to Hadoop is by having an InputFormat defined for it. An InputFormat for Hadoop has two main responsibilities:

  1. Figure out how to split the input into chunks that can be processed separately by individual mappers (thus getting the distributed processing power of Hadoop).
  2. For each split, figure out how to read the data.

The second responsibility for our purposes is just grunt work: we just use the excellent Azure Storage SDK to read the data and pass it on as WritableEntity objects (in Hive, the StorageHandler understands those and maps them to columns). The first one though is a bit trickier: there is no universally good way to partition data to get the most out of our cluster. Ideally, if you have a 100 nodes (each with 4 mapper slots) processing your data, you want at least 400 splits among your mappers. Also, you want each mapper's data to fit in memory, so you don't want too many rows per mapper (also because if a mapper runs for an hour then fails it'll have to redo that all over again). But you don't want each mapper to have too little to do either: then your job run time will be dominated by overhead from starting/stopping all these mappers. So in short: it's an art to tune this partitioning step, and I certainly don't have it right by default for most real-world situations. By default: I query for all the partition keys in the table (one-by-one, so it's a slow process if there are many keys), then split the table by assigning each mapper to one of those values. If you want to customize the partitioning for your data, you can do that by using the configuration knob 'azure.table.partitioner.class' to specify a custom partitioner that you can provide.

Hope that helps! Please leave comments here and/or create issues/pull requests on the project in github if you have any suggestions/feedback.

Comments

  • Anonymous
    April 07, 2014
    Hi there, Interesting stuff. Haven't had a go at this yet (I will do soon) but in the meantime, do you know any way of INSERTing data into Azure table Storage using Hive? Regards JT

  • Anonymous
    April 07, 2014
    Hi Jamie - thanks for your interest. Hive extensibility does allow INSERT into externally stored data like this, but unfortunately it's still a TODO in the Azure tables code I wrote up so it won't work with that (and I don't personally know of others who implemented INSERT functionality). Hopefully I'll get around to implementing it soon, but if you feel up to it feel free to dig in, code it up and send a pull request. Mostafa

  • Anonymous
    April 07, 2014
    Cool, thank you Mostafa.

  • Anonymous
    April 21, 2014
    This is really an interesting article. Just wondering how HDInsight team adds the support of Azure Storage to HIVE. Is it also through Storage Handler?

  • Anonymous
    April 21, 2014
    @Justin: Thank you! Azure Blob Storage is exposed as a file system within Hadoop, so Hive can just use it like it uses any other Hadoop file system. See this article for details on how it works: azure.microsoft.com/.../hdinsight-use-blob-storage

  • Anonymous
    May 28, 2014
    I have been having some difficulty setting this up to query my Azure Table.  I keep getting: HiveException: Error in loading storage handler.com.microsoft.hadoop.azure.hive.AzureTableHiveStorageHandler I set up a new HDInsight Cluster (2.2), got the GIT code and Packaged it.  It created the: microsoft-hadoop-azure-0.0.2.jar I then copied this JAR file everywhere in my shared blob storage for Hive.   I then remote to the Azure HDInsight Cluster server.  Copied the JAR file to the local file server in Azure and modified the hive-site.xml file to add the file path property reference.  (But I don't know how to restart hive to pick up these settings?) Any help in figuring out why Hive can't find the JAR? Thanks.

  • Anonymous
    June 07, 2014
    The comment has been removed

  • Anonymous
    June 25, 2014
    works well with hdinsight 3.0 but not working with hdinsight 3.1 JVM throw method not found on entitypropertity inspector

  • Anonymous
    June 27, 2014
    @Tom: this is probably because I've since updated the project to HDI 3.0 (and now to HDI 3.1, see my response to Huan below)... @Huan: thanks for pointing it out! It turns out there were some minor API breaking changes between Hive 0.12 (in HDI 3.0) and Hive 0.13 (in HDI 3.1). I've updated my code to comply with the latest changes, and created a tag in git, Hive12, for anyone who wants the HDI 3.0 version. I'll update the post to reflect all that...

  • Anonymous
    July 29, 2014
    Thank you so much, this worked great. I was using 3.0 HDINsight, and I didn't notice a tag or a branch for it like you mentioned, but I was able to get the code at a specific version and compile it and go

  • Anonymous
    July 29, 2014
    how hard would it be to add in support for having the paritionkey and rowkey in the external table as well? currently only properties are able to be in the external table. thanks

  • Anonymous
    August 01, 2014
    @rob: glad it worked for you. Adding the PartitionKey and RowKey was pretty easy so I added it (commit d57064e). Please sync to the newer version and try it out - or if you want it on the older HDInsight versions just diff that commit and apply it to the older version.

  • Anonymous
    August 11, 2014
    Nice article :) Just curious to know how much difference we see in terms of performance while reading data from blob storage(assuming its written in delimited compressed format) vs azure table storage? As in case of blob storage, we are reading a file system but with table storage handler, we are still reading 1000 records in one call and will be receiving complete data using continuation token. It may affect the data read performance?

  • Anonymous
    September 21, 2014
    Job fails when i calculate average. UI only shows following error msg. Description: An exception occurred while processing your request. Additionally, another exception occurred while executing the custom error page for the first exception. The request has been terminated.

  • Anonymous
    October 03, 2014
    This is amazing man. However, after creating the table I get: In Powershell: Object reference not set to an instance of an object. In Hive: Number of reduce tasks is set to 0 since there's no reduce operator java.util.NoSuchElementException I think the table might be empty. It seems to not have gotten the data from my table. I'm not sure though. Any thoughts? =Balanc3

  • Anonymous
    October 10, 2014
    I am getting an error where hive cannot recognize entity property other than partitionKey and rowKey. Is any one getting the same error?

  • Anonymous
    December 08, 2014
    Is there any way to run this connector to the local Azure storage emulator from the local HdInsight emulator?  Thank You!

  • Anonymous
    January 13, 2015
    The comment has been removed

  • Anonymous
    March 31, 2015
    Hive has the "PARTITIONED BY" clause that allows one to partition the information hive caches according to the values in one or more "columns", making queries much faster.  When I try this with your (very nice) code, my table ends up with no rows (it has over 14k rows without that clause).  Do you think using a custom InputFormat/RecordReader is incompatible with "PARTITIONED BY"?  Am I missing something?

  • Anonymous
    May 12, 2015
    When creating an external table can I not select Timestamp?

  • Anonymous
    May 19, 2015
    I created a HDInsight cluster (version 3.2) and included microsoft-hadoop-azure-0.0.5.jar as additional library. I then created an external Hive table pointing to Azure table storage. This worked fine. But when I issued a Select statement against the Hive table, I got the error: Exception in thread "main" java.lang.NoClassDefFoundError: com/microsoft/windowsazure/storage/StorageException Has anyone  come across this error?

  • Anonymous
    May 27, 2015
    I haven't been able to figure out how to create an external table with Timestamp as one of the columns. Can someone please share an example. Thanks.

  • Anonymous
    June 10, 2015
    I was able to create a table with timestamp by specifying timestamp as the column type: CREATE EXTERNAL TABLE TestTableHive(PartitionKey string, RowKey string, Timestamp timestamp)

  • Anonymous
    June 18, 2015
    is the com.microsoft.hadoop.azure.hive.AzureTableHiveStorageHandler   installed by default on newly created  Hdinsight nodes ?  or this is something we have to get and install ?  

  • Anonymous
    July 10, 2015
    Hi, I've got an error too: ClassNotFoundException Class com.microsoft.hadoop.azure.hive.AzureEntitySerDe. I run the cluster in HD 3.1. Has anyone  come across this error?

  • Anonymous
    July 15, 2015
    I am using Hortonworks hdp 2.2.4 on linux with microsoft-hadoop-azure-0.0.5.jar. I can create the external table fine. But I got the following error when running select. Could you help? Exception in thread "main" java.lang.NoClassDefFoundError: com/microsoft/windowsazure/storage/StorageException        at java.lang.Class.getDeclaredConstructors0(Native Method)        at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585)        at java.lang.Class.getConstructor0(Class.java:2885)        at java.lang.Class.getDeclaredConstructor(Class.java:2058)        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)        at org.apache.hadoop.hive.ql.exec.FetchOperator.getInputFormatFromCache(FetchOperator.java:229)        at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:440)        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:588)        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:561)        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1623)        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:2

  • Anonymous
    September 28, 2015
    With the new ARM commandlets above doesn't work anymore, any pointers on how to do it for ARM?

  • Anonymous
    October 14, 2015
    Looks like the syntax for Add-AzureHDInsightConfigValues has changed. It appears to take a hashtable now for the -HiveEnv or -HiveSite parameters (-Hive no longer exists). Cannot get the syntax correct for passing in the path to the .jar file now. Any help would be appreciated. Thanks.