Creating workflows with Oozie
From: Developing big data solutions on Microsoft Azure HDInsight
Oozie is the most commonly used mechanism for workflow development in Hadoop. It is a tool that enables you to create repeatable, dynamic workflows for tasks to be performed in an HDInsight cluster. The tasks themselves are specified in a control dependency direct acyclic graph (DAG), which is stored in a Hadoop Process Definition Language (hPDL) file named workflow.xml in a folder in the HDFS file system on the cluster. For example, the following hPDL document contains a DAG in which a single step (or action) is defined.
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>script.q</script>
<param>INPUT_TABLE=HiveSampleTable</param>
<param>OUTPUT=/results/sampledata</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
The action itself is a Hive job defined in a HiveQL script file named script.q, with two parameters named INPUT_TABLE and OUTPUT. The code in script.q is shown in the following example.
INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM ${INPUT_TABLE}
The script file is stored in the same folder as the workflow.xml hPDL file, along with a standard configuration file for Hive jobs named hive-default.xml.
A configuration file named job.properties is stored on the local file system of the computer on which the Oozie client tools are installed. This file, shown in the following example, contains the settings that will be used to execute the job.
nameNode=wasb://my_container@my_asv_account.blob.core.windows.net
jobTracker=jobtrackerhost:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=/workflowfiles/
To initiate the workflow, the following command is executed on the computer where the Oozie client tools are installed.
oozie job -oozie https://localhost:11000/oozie/ -config c:\scripts\job.properties –run
When Oozie starts the workflow it returns a job ID in the format 0000001-123456789123456-oozie-hdp-W. You can check the status of a job by opening a Remote Desktop connection to the cluster and using a web browser to navigate to https://localhost:11000/oozie/v0/job/job-id?show=log.
You can also initiate an Oozie job by using Windows PowerShell or the .NET SDK for HDInsight. For more details see Initiating an Oozie workflow with PowerShell and Initiating an Oozie workflow from a .NET application.
Note
For more information about Oozie see Apache Oozie Workflow Scheduler for Hadoop. An example of using Oozie can be found in Scenario 3: ETL automation.