The HDInsight solution

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

The HDInsight solution for Scenario 4: BI integration is based on an HDInsight cluster, enabling Adventure Works to dynamically increase or reduce cluster resources as required.

Creating the Hive tables

To support the goal for summarization of the data, the BI developer intends to create a Hive table that can be queried from client applications such as Excel. However, the raw source data includes header rows that must be excluded from the analysis, and the tab-delimited text format of the source data will not provide optimal query performance as the volume of data grows.

The developer therefore decided to create a temporary staging table, and use it as a source for a query that loads the required data into a permanent table that is optimized for the typical analytical queries that will be used. The following HiveQL statement has been used to define a staging table for the log data.

DROP TABLE log_staging;
CREATE TABLE log_staging
  (logdate STRING, logtime STRING, c_ip STRING, cs_username STRING, s_ip STRING,
   s_port STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING,
   sc_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT, 
   cs_User_Agent STRING, cs_Referrer STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '32'
  STORED AS TEXTFILE LOCATION '/data';

This Hive table defines a schema for the log file, making it possible to use a query that filters the rows in order to load the required data into a permanent table for analysis. Notice that the staging table is based on the /data folder but it is not defined as EXTERNAL, so dropping the staging table after the required rows have been loaded into the permanent table will delete the source files that are no longer required.

When designing the permanent table for analytical queries, the BI developer has decided to partition the data by year and month to improve query performance when extracting data. To achieve this, a second Hive statement is used to define a partitioned table—notice the PARTITIONED BY clause near the end of the following script. This instructs Hive to add two columns named year and month to the table, and to partition the data loaded into the table based on the values inserted into these columns.

DROP TABLE iis_log;
CREATE TABLE iis_log
  (logdate STRING, logtime STRING, c_ip STRING, cs_username STRING, s_ip STRING,
   s_port STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING,
   sc_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT, 
   cs_User_Agent STRING, cs_Referrer STRING)
  PARTITIONED BY (year INT, month INT)
  STORED AS SEQUENCEFILE;

The Hive scripts to create the tables are saved as text files in a local folder named scripts.

Note

Storing the data in SEQUENCEFILE format can improve performance. You might also consider using the ORC file format, which provides a highly efficient way to store Hive data and can improve performance when reading, writing, and processing data. See ORC File Format for more information.

Next, the following Hive script is created to load data from the log_staging table into the iis_log table. This script takes the values from the columns in the log_staging Hive table, calculates the values for the year and month of each row, and inserts these rows into the partitioned iis_log Hive table.

SET mapred.output.compression.type=BLOCK; 
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM log_staging s
INSERT INTO TABLE iis_log PARTITION (year, month)
  SELECT s.logdate, s.logtime, s.c_ip, s.cs_username, s.s_ip, s.s_port, s.cs_method,
         s.cs_uri_stem, s.cs_uri_query, s.sc_status, s.sc_bytes, s.cs_bytes,
         s.time_taken, s.cs_User_Agent, s.cs_Referrer,
         SUBSTR(s.logdate, 1, 4) year, SUBSTR(s.logdate, 6, 2) month
  WHERE SUBSTR(s.logdate, 1, 1) <> '#';

The source log data includes a number of header rows that are prefixed with the “#” character, which could cause errors or add unnecessary complexity to summarizing the data. To resolve this the HiveQL statement shown above includes a WHERE clause that ignores rows starting with “#” so that they are not loaded into the permanent table.

To maximize performance the script includes statements that specify the output from the query should be compressed. The code also sets the dynamic partition mode to nonstrict, enabling rows to be dynamically inserted into the appropriate partitions based on the values of the partition columns.

Indexing the Hive table

To further improve performance, the BI developer at Adventure Works decided to index the iis_log table on the date because most queries will need to select and sort the results by date. The following HiveQL statement creates an index on the logdate column of the table.

CREATE INDEX idx_logdate ON TABLE iis_log (logdate)
AS org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;

When data is added to the iis_log table the index can be updated using the following HiveQL statement.

ALTER INDEX idx_logdate ON iis_log REBUILD;

The scripts to load the iis_log table and build the index are also saved in the local scripts folder.

Note

Tests revealed that indexing the tables provided only a small improvement in performance of queries, and that building the index took longer than the time saved when running the query. However, the results depend on factors such as the volume of source data, and so you should experiment to see if indexing is a useful optimization technique in your scenario.

Loading data into the Hive tables

To upload the IIS log files to HDInsight, and load the Hive tables, the team at Adventure Works created the following PowerShell script:

# Azure subscription-specific variables.
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"

# Find the local folder where this script is stored.
$thisfolder = Split-Path -parent $MyInvocation.MyCommand.Definition 

# Upload the scripts.
$localfolder = "$thisfolder\scripts"
$destfolder = "scripts"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName $storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder foreach($file in $files)
{
  $fileName = "$localFolder\$file"   
  $blobName = "$destfolder/$file"   
  write-host "copying $fileName to $blobName"   
  Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName -Context $blobContext -Force 
} 
write-host "All files in $localFolder uploaded to $containerName!"

# Run scripts to create Hive tables.
write-host "Creating Hive tables..."
$jobDef = New-AzureHDInsightHiveJobDefinition 
  -File"wasbs://$containerName@$storageAccountName.blob.core.windows.net/scripts/CreateTables.txt"
$hiveJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef
Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId -StandardError

# Upload data to staging table.
$localfolder = "$thisfolder\iislogs"
$destfolder = "data"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName $storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder foreach($file in $files)
{
  $fileName = "$localFolder\$file"   
  $blobName = "$destfolder/$file"   
  write-host "copying $fileName to $blobName"   
  Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName -Context $blobContext -Force
}
write-host "All files in $localFolder uploaded to $containerName!"

# Run scripts to load Hive tables.
write-host "Loading Hive table..."
$jobDef = New-AzureHDInsightHiveJobDefinition 
  -File "wasbs://$containerName@$storageAccountName.blob.core.windows.net/scripts/LoadTables.txt"
$hiveJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef
Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId -StandardError

# All done!
write-host "Finished!"

This script performs the following actions:

  • It uploads the contents of the local scripts folder to the /scripts folder in HDInsight.
  • It runs the CreatedTables.txt Hive script to drop and recreate the log_staging and iis_log tables (any previously uploaded data will be deleted because both are****internal tables).
  • It uploads the contents of the local iislogs folder to the /data folder in HDInsight (thereby loading the source data into the staging table).
  • It runs the LoadTables.txt Hive script to load the data from the log_staging table into the iis_log table and create an index (note that the text data in the staging table is implicitly converted to SEQUENCEFILE format as it is inserted into the iis_log table).

Returning the required data

After running the PowerShell script that uploads the data to Azure blob storage and inserts it into the Hive tables, the data is now ready for use in the corporate BI system—and it can be consumed by querying the iis_log table. For example, the following script returns all of the data for all years and months.

SELECT * FROM iis_log

Where a more restricted dataset is required the BI developer or business user can use a script that selects on the year and month columns, and transforms the data as required. For example, the following script extracts just the data for the first quarter of 2012, aggregates the number of hits for each day (the logdate column contains the date in the form yyyy-mm-dd), and returns a dataset with two columns: the date and the total number of page hits.

SELECT logdate, COUNT(*) pagehits FROM iis_log
  WHERE year = 2012 AND month < 4
  GROUP BY logdate

Next Topic | Previous Topic | Home | Community