Designing end-to-end solutions
From: Developing big data solutions on Microsoft Azure HDInsight
Automation enables you to avoid some or all of the manual operations required to perform your specific big data processing tasks. Unless you are simply experimenting with some data you will probably want to create a completely automated end-to-end solution. For example, you may want to make a solution repeatable without requiring manual interaction every time, perhaps incorporate a workflow, and even execute the entire solution automatically on a schedule. HDInsight supports a range of technologies and techniques to help you achieve this, several of which are used in the example scenario you’ve already seen this guide.
You can think of an end-to-end big data solution as being a process that encompasses multiple discrete sub-processes. Throughout this guide you have seen how to automate these individual sub-processes using a range of tools such as Windows PowerShell, the .NET SDK for HDInsight, SQL Server Integration Services, Oozie, and command line tools.
A typical big data process might consist of the following sub-processes:
- Data ingestion: source data is loaded to Azure storage, ready for processing. For details of how you can automate individual tasks for data ingestion see Custom data upload clients in the section Collecting and loading data into HDInsight.
- Cluster provisioning: When the data is ready to be processed, a cluster is provisioned. For details of how you can automate cluster provisioning see Custom cluster management clients in the section Collecting and loading data into HDInsight.
- Job submission and management: One or more jobs are executed on the cluster to process the data and generate the required output. For details of how you can automate individual tasks for submitting and managing jobs see Building custom clients in the section Processing, querying, and transforming data using HDInsight.
- Data consumption: The job output is retrieved from HDInsight, either directly by a client application or through data transfer to a permanent data store. For details of how you can automate data consumption tasks see Building custom clients in the section Consuming and visualizing data from HDInsight.
- Cluster deletion: The cluster is deleted when it is no longer required to process data or service Hive queries. For details of how you can delete a cluster see Custom cluster management clients in the section Collecting and loading data into HDInsight.
- Data visualization: The retrieved results are visualized and analyzed, or used in a business application. For details of tools for visualizing and analyzing the results see the section Consuming and visualizing data from HDInsight.
However, before beginning to design an automated solution, it is sensible to start by identifying the dependencies and constraints in your specific data processing scenario, and considering the requirements for each stage in the overall solution. For example, you must consider how to coordinate the automation of these operations as a whole, as well as planning the scheduling of each discrete task.
This section includes the following topics related to designing automated end-to-end solutions:
- Workflow dependencies and constraints
- Task design and context
- Coordinating solutions and tasks
- Scheduling solution and task execution
- Security
Considerations
Consider the following points when designing and implementing end-to-end solutions around HDInsight:
- Analyze the requirements for the solution before you start to implement automation. Consider factors such as how the data will be collected, the rate at which it arrives, the timeliness of the results, the need for quick access to aggregated results, and the consequent impact of the speed of processing each batch. All of these factors will influence the processes and technologies you choose, the batch size for each process, and the overall scheduling for the solution.
- Automating a solution can help to minimize errors for tasks that are repeated regularly, and by setting permissions on the client-side applications that initiate jobs and access the data you can also limit access so that only your authorized users can execute them. Automation is likely to be necessary for all types of solutions except those where you are just experimenting with data and processes.
- The individual tasks in your solutions will have specific dependencies and constraints that you must accommodate to achieve the best overall data processing workflow. Typically these dependencies are time based and affect how you orchestrate and schedule the tasks and processes. Not only must they execute in the correct order, but you may also need to ensure that specific tasks will be completed before the next one begins. See Workflow dependencies and constraints for more information.
- Consider if you need to automate the creation of storage accounts to hold the cluster data, and decide when this should occur. HDInsight can automatically create one or more linked storage accounts for the data as part of the cluster provisioning process. Alternatively, you can automate the creation of linked storage accounts before you create a cluster, and non-linked storage accounts before or after you create a cluster. For example, you might automate creating a new storage account, loading the data, creating a cluster that uses the new storage account, and then executing a job. For more information about linked and non-linked storage accounts see Cluster and storage initialization in the section Collecting and loading data into HDInsight.
- Consider the end-to-end security of your solution. You must protect the data from unauthorized access and tampering when it is in storage and on the wire, and secure the cluster as a whole to prevent unauthorized access. See Security for more details.
- As with any complex multi-step solution, it is important to make monitoring and troubleshooting as easy as possible by maintaining detailed logs of the individual stages of the overall process. This typically requires comprehensive exception handling and well as planning how to log the information. See Monitoring and logging for more information.