Workflow and job orchestration

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Many data processing solutions require the coordinated processing of multiple jobs, often with a conditional workflow. For example, consider a solution in which data files containing web server log entries are uploaded each day, and must be parsed to load the data they contain into a Hive table. A workflow to accomplish this might consist of the following steps:

  1. Insert data from the files located in the /uploads folder into the Hive table.
  2. Delete the source files, which are no longer required.

This workflow is relatively simple, but could become more complex when other required tasks are added. For example:

  1. If there are no files in the /uploads folder, go to step 5.
  2. Insert data from the files into the Hive table.
  3. Delete the source files, which are no longer required.
  4. Send an email message to an operator indicating success, and stop.
  5. Send an email message to an operator indicating failure.

Implementing these kinds of workflows is possible in a range of ways. For example, you could:

  • Use the Oozie framework that is installed with HDInsight, and PowerShell or the Oozie client in the HDInsight .NET SDK to execute it. This is a good option when:

    • You are familiar with the syntax and usage of Oozie.
    • You want to execute workflows from within a program running on a client computer.
    • You are familiar with .NET and prepared to write programs that use the .NET Framework.

    Note

    For more information see Use Oozie with HDInsight. If you are not familiar with Oozie, see Creating workflows with Oozie for an overview of how it can be used. A demonstration of using an Oozie workflow can also be found in the topic Scenario 3: ETL automation.

  • Use SQL Server Integration Services (SSIS) or a similar integration framework. This is a good option when:

    • You have SQL Server installed and are experienced with writing SSIS workflows.
    • You want to take advantage of the powerful capabilities of SSIS workflows.

    Note

    The process for creating SSIS workflows is described in more detail in Scenario 4: BI integration.

  • Use the Cascading abstraction layer software. This is a good choice when:

    • You want to execute complex data processing workflows written in any language that runs on the Java virtual machine.
    • You have complex multi-level workflows that you need to combine into a single task.
    • You want to control the execution of the map and reduce phases of jobs directly in code.
  • Create a custom application or script that executes the tasks as a workflow. This is a good option when:

    • You need a fairly simple workflow that can be expressed using your chosen programming or scripting language.
    • You want to run scripts on a schedule, perhaps driven by Windows Scheduled Tasks.
    • You are prepared to use a Remote Desktop connection to communicate with the cluster to administer the processes.

Third party workflow frameworks such as Hamake or Azkaban are also available and are a good option when you are familiar with these tools, or if they offer a capability you need that is not available in other tools. However, they are not currently supported on HDInsight.

More information

Oozie workflows can be executed using the Oozie time-based coordinator, or by using the classes in the HDInsight SDK. The topic Use Oozie with HDInsight on the Azure website describes how you can use Oozie, and the topic Use time-based Oozie Coordinator with HDInsight extends this to show time-based coordination of a workflow.

For information about automating an entire solution see Building end-to-end solutions using HDInsight.

Next Topic | Previous Topic | Home | Community