Share via


Building end-to-end solutions using HDInsight

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Many scenarios for using a big data solution such as HDInsight will focus on exploring data, perhaps from newly discovered sources, and then iteratively refining the queries and transformations used to find insights within that data. After you discover questions that provide useful and valid information from the data, and determine the tasks that are required to accomplish this, you will probably want to explore how you can automate and manage the entire solution.

Alternatively, you may already have a definite plan for using HDInsight—perhaps as an ETL automation mechanism, as a data warehouse, or for integration with an existing BI system. In all of these scenarios, automation can help you to more easily execute repeated processes in a predictable way, and with a reduced chance of operator error.

Figure 1 shows the typical stages and some of the tasks in a big data solution, for which you may decide to automate all or selected parts.

Figure 1 - The typical tasks in an end-to-end big data solution

Figure 1 - The typical tasks in an end-to-end big data solution

The automation and orchestration of these tasks must be planned carefully to create an overall solution that performs efficiently and can be easily integrated into business practices. The more complex your big data processing requirements, the more important it is to plan the coordination of all the “moving parts” in the solution to achieve the required results in as efficient and error-free way as possible.

This section of the guide focuses on building end-to-end solutions that minimize the need for operator or administrator intervention, maximize the security of the process and the data, and provide sufficient information to be able to monitor solutions. This section is divided into two distinct topic areas:

  • Designing end-to-end solutions. This includes planning the solution to meet the requirements of dependencies, constraints, and consistency; protecting the application, the data, and the cluster; and implementing scheduling for the overall process and the individual tasks.
  • Monitoring and logging. This includes monitoring the cluster itself and the individual tasks, auditing operations, and accessing log files.

More information

For more information about HDInsight see the Microsoft Azure HDInsight web page.

See Collecting and loading data into HDInsight for more details and considerations for provisioning a cluster and storage, and uploading data to a big data solution such as HDInsight.

See Processing, querying, and transforming data using HDInsight for more details and considerations for processing big data with HDInsight.

See Consuming and visualizing data from HDInsight for more details and considerations for consuming the output of big data processing jobs.

See Appendix A - Tools and technologies reference for information about the many tools, frameworks, utilities, and technologies you can adopt to help automate an end-to-end solution.

Next Topic | Previous Topic | Home | Community