提交HDInsight Pig作业
Pig是Hadoop常用的一个模块,Azure提供了使用PowerShell提交Pig作业的方式,当Pig Latin的脚本较为短小时,可使用New-AzureHDInsightPigJobDefinition的-Query直接指定脚本内容,示例如下:
$clusterName = "HDIDemo"
$QueryString = "intxt1 = load 'wasb://hdirawdata@teststorage.blob.core.chinacloudapi.cn/userbehavior.log' ;" +
"store intxt1 into 'wasb:///home/mytest1' ;"
$pigJobDefinition = New-AzureHDInsightPigJobDefinition -Query $QueryString
$pigJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $pigJobDefinition
Wait-AzureHDInsightJob -Job $pigJob -WaitTimeoutInSeconds 3600
Write-Host "Display the standard output ..." -ForegroundColor Green
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $pigJob.JobId -StandardError
当Pig Latin的脚本过长时,则会遇到”The input line is too long”的错误,这是由于一次提交的batch过长而导致,这时调用Pig Latin文本就显得尤为重要。具体方法如下:
- 将如下Pig Latin的脚本存于后缀名为.pig的文件中(本例中为PigLatinTest.pig),并将其存储于Azure blob storage中。
intxt1 = load 'wasb://amberhdirawdata@amberstorage.blob.core.chinacloudapi.cn/userbehavior.log'; store intxt1 into 'wasb:///home/mytest1' ;
2.
使用如下命令调用Pig Latin脚本、并执行Pig作业:
$clusterName = "AmberHDIDemo"
$pigJobDefinition = New-AzureHDInsightPigJobDefinition -File "wasb://hdirawdata@teststorage.blob.core.chinacloudapi.cn/userbehavior.pig" -StatusFolder $statusFolder -Verbose
$pigJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $pigJobDefinition
Wait-AzureHDInsightJob -Job $pigJob -WaitTimeoutInSeconds 3600
# Print the output of the Pig job.
Write-Host "Display the standard output ..." -ForegroundColor Green
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $pigJob.JobId -StandardError