Designing big data solutions using HDInsight

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

This section of the guide explores the common use cases for batch processing in Hadoop-based big data solutions. These include iterative exploration, data warehouse on demand, ETL automation, and BI integration. The guide focuses primarily on Microsoft Azure HDInsight, but the models described here can be easily adapted to many other big data frameworks.

Understand, Design, Implement

Hadoop-based big data solutions open up new opportunities for converting data into information. They can also be used to extend existing information systems to provide additional insights through analytics and data visualization. Every organization is different, and so there is no definitive list of the ways you can use these types of solution as part of your own business processes.

However, there are four general use cases and corresponding models, described below, that are appropriate for the typical batch processing workloads on an HDInsight cluster. Understanding these use cases will help you to start making decisions on how best to integrate HDInsight with your organization, and with your existing BI systems and tools.

Note

By incorporating additional applications that run under the YARN resource manager, HDInsight can be used to perform real-time processing of streaming data. However, this topic is outside the scope of the guide.

Use case 1: Iterative exploration

Figure 1 - The iterative exploration model

Figure 1 - The iterative exploration model

This model is typically chosen for experimenting with data sources to discover if they can provide useful information, and for handling data that you cannot process using existing systems. For example, you might collect feedback from customers through email, web pages, or external sources such as social media sites, then analyze it to get a picture of user sentiment for your products. You might be able to combine this information with other data, such as demographic data that indicates population density and characteristics in each city where your products are sold. For more details, see the Use case 1: Iterative exploration use case and batch processing model. For an example of using this model see Scenario 1: Iterative exploration.

Use case 2: Data warehouse on demand

Figure 2 - The data warehouse on demand model

Figure 2 - The data warehouse on demand model

Hadoop-based big data systems such as HDInsight allow you to store both the source data and the results of queries executed over this data. You can also store schemas (or, to be precise, metadata) for tables that are populated by the queries you execute. These tables can be indexed, although there is no formal mechanism for managing key-based relationships between them. However, you can create data repositories that are robust and reasonably low cost to maintain, which is especially useful if you need to store and manage huge volumes of data. For more details, see the Use case 2: Data warehouse on demand use case and batch processing model. For an example of using this model see Scenario 2: Data warehouse on demand.

Use case 3: ETL automation

Figure 3 - The ETL automation model

Figure 3 - The ETL automation model

Hadoop-based big data systems such as HDInsight can be used to extract and transform data before you load it into your existing databases or data visualization tools. Such solutions are well suited to performing categorization and normalization of data, and for extracting summary results to remove duplication and redundancy. This is typically referred to as an Extract, Transform, and Load (ETL) process. For more details, see the Use case 3: ETL automation use case and batch processing model. For an example of using this model see Scenario 3: ETL automation.

Use case 4: BI integration

Figure 4 - The BI integration model

Figure 4 - The BI integration model

Enterprise-level data warehouses have some special characteristics that differentiate them from on-line transaction processing (OLTP) database systems, and so there are additional considerations for integrating with batch processing big data systems such as HDInsight. For example, you can integrate at different levels, depending on the way that you intend to use the data obtained from your big data solution. For more details, see the Use case 4: BI integration use case and batch processing model. For an example of using this model see Scenario 4: BI integration.

Next Topic | Previous Topic | Home | Community