Share via


Retrieving job output files with Windows PowerShell

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Hive is the most commonly used Hadoop technology for big data processing in HDInsight. However, in some scenarios the data may be processed using a technology such as Pig or custom map/reduce code, which does not overlay the output files with a tabular schema that can be queried. In this case, your custom PowerShell code must download the files generated by the HDInsight job and display their contents.

The Get-AzureStorageBlobContent cmdlet enables you to download an entire blob path from an Azure storage container, replicating the folder structure represented by the blob path on the local file system. To use the Get-AzureStorageBlobContent cmdlet you must first instantiate a storage context by using the New-AzureStorageContext cmdlet. This requires a valid storage key for your Azure storage account, which you can retrieve by using the Get-AzureStorageKey cmdlet.

The Set-AzureStorageBlobContent cmdlet is used to copy local files to an Azure storage container. The Set-AzureStorageBlobContent and Get-AzureStorageBlobContent cmdlets are often used together when working with HDInsight to upload source data and scripts to Azure before initiating a data processing job, and then to download the output of the job.

As an example, the following PowerShell code uses the Set-AzureStorageBlobContent cmdlet to upload a Pig Latin script named SummarizeWeather.pig, which is then invoked using the New-AzureHDInsightPigJobDefinition and Start-AzureHDInsightJob cmdlets. The output file generated by the job is downloaded using the Get-AzureStorageBlobContent cmdlet, and its contents are displayed using the cat command.

$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"

# Find the folder where this script is saved.
$localfolder = Split-Path -parent $MyInvocation.MyCommand.Definition

$destfolder = "weather/scripts"
$scriptFile = "SummarizeWeather.pig"
$outputFolder = "weather/output"
$outputFile = "part-r-00000"

# Upload Pig Latin script.
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName $storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
$blobName = "$destfolder/$scriptFile"
$filename = "$localfolder\$scriptFile"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName -Context $blobContext -Force
write-host "$scriptFile uploaded to $containerName!"

# Run the Pig Latin script.
$jobDef = New-AzureHDInsightPigJobDefinition -File "wasb:///$destfolder/$scriptFile"
$pigJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef
Write-Host "Pig job submitted..."
Wait-AzureHDInsightJob -Job $pigJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $pigJob.JobId -StandardError

# Get the job output.
$remoteblob = "$outputFolder/$outputFile"
write-host "Downloading $remoteBlob..."
Get-AzureStorageBlobContent -Container $containerName -Blob $remoteblob -Context $blobContext -Destination $localFolder
cat $localFolder\$outputFolder\$outputFile

The SummarizeWeather.pig script in this example generates the average wind speed and temperature for each date in the source data, and stores the results in the /weather/output folder as shown in the following code example.

Weather = LOAD '/weather/data' USING PigStorage(',') AS (obs_date:chararray, obs_time:chararray, weekday:chararray, windspeed:float, temp:float);
GroupedWeather = GROUP Weather BY obs_date;
AggWeather = FOREACH GroupedWeather GENERATE group, AVG(Weather.windspeed) AS avg_windspeed, MAX(Weather.temp) AS high_temp;
DailyWeather = FOREACH AggWeather GENERATE FLATTEN(group) AS obs_date, avg_windspeed, high_temp;
SortedWeather = ORDER DailyWeather BY obs_date ASC;
STORE SortedWeather INTO '/weather/output';

Figure 1 shows how the results of this script are displayed in the Windows PowerShell ISE.

Figure 1 - Using the Get-AzureStorageBlobContent cmdlet in the Windows PowerShell ISE

Figure 1 - Using the Get-AzureStorageBlobContent cmdlet in the Windows PowerShell ISE

Note that the script must include the name of the output file to be downloaded. In most cases, Pig jobs generate files in the format part-r-0000x. Some map/reduce operations may create files with the format part-m-0000x, and Hive jobs that insert data into new tables generate numeric filenames such as 000000_0. In most cases you will need to determine the specific filename(s) generated by your data processing job before writing PowerShell code to download the output.

The contents of downloaded files can be displayed in the console using the cat command, as in the example above, or you could open a file containing delimited text results in Excel.

Next Topic | Previous Topic | Home | Community