Run MapReduce jobs with Apache Hadoop on HDInsight using REST
Článok
Learn how to use the Apache Hive WebHCat REST API to run MapReduce jobs on an Apache Hadoop on HDInsight cluster. Curl is used to demonstrate how you can interact with HDInsight by using raw HTTP requests to run MapReduce jobs.
When you use Curl or any other REST communication with WebHCat, you must authenticate the requests by providing the HDInsight cluster administrator user name and password. You must use the cluster name as part of the URI that is used to send the requests to the server.
The REST API is secured by using basic access authentication. You should always make requests by using HTTPS to ensure that your credentials are securely sent to the server.
Curl
For ease of use, set the variables below. This example is based on a Windows environment, revise as needed for your environment.
Windows Command Prompt
set CLUSTERNAME=
set PASSWORD=
From a command line, use the following command to verify that you can connect to your HDInsight cluster:
The end of the URI (/mapreduce/jar) tells WebHCat that this request starts a MapReduce job from a class in a jar file. The parameters used in this command are as follows:
-d: -G isn't used, so the request defaults to the POST method. -d specifies the data values that are sent with the request.
user.name: The user who is running the command
jar: The location of the jar file that contains class to be ran
class: The class that contains the MapReduce logic
arg: The arguments to be passed to the MapReduce job. In this case, the input text file and the directory that are used for the output
This command should return a job ID that can be used to check the status of the job: job_1415651640909_0026.
To check the status of the job, use the following command. Replace the value for JOBID with the actual value returned in the previous step. Revise location of jq as needed.
For ease of use, set the variables below. Replace CLUSTERNAME with your actual cluster name. Execute the command and enter the cluster login password when prompted.
PowerShell
$clusterName="CLUSTERNAME"$creds = Get-Credential -UserName admin -Message"Enter the cluster login password"
Use the following command to verify that you can connect to your HDInsight cluster:
The end of the URI (/mapreduce/jar) tells WebHCat that this request starts a MapReduce job from a class in a jar file. The parameters used in this command are as follows:
user.name: The user who is running the command
jar: The location of the jar file that contains class to be ran
class: The class that contains the MapReduce logic
arg: The arguments to be passed to the MapReduce job. In this case, the input text file and the directory that are used for the output
This command should return a job ID that can be used to check the status of the job: job_1415651640909_0026.
To check the status of the job, use the following command:
PowerShell
$reqParams=@{"user.name"="admin"}
$resp = Invoke-WebRequest -Uri"https://$clusterName.azurehdinsight.net/templeton/v1/jobs/$jobID" `
-Credential$creds `
-Body$reqParams `
-UseBasicParsing# ConvertFrom-JSON can't handle duplicate names with different case# So change one to prevent the error$fixDup=$resp.Content.Replace("jobID","job_ID")
(ConvertFrom-Json$fixDup).status.state
Both methods
If the job is complete, the state returned is SUCCEEDED.
When the state of the job has changed to SUCCEEDED, you can retrieve the results of the job from Azure Blob storage. The statusdir parameter that is passed with the query contains the location of the output file. In this example, the location is /example/curl. This address stores the output of the job in the clusters default storage at /example/curl.
Azure HPC is a purpose-built cloud capability for HPC & AI workload, using leading-edge processors and HPC-class InfiniBand interconnect, to deliver the best application performance, scalability, and value. Azure HPC enables users to unlock innovation, productivity, and business agility, through a highly available range of HPC & AI technologies that can be dynamically allocated as your business and technical needs change. This learning path is a series of modules that help you get started on Azure HPC - you