Designing big data solutions using HDInsight
This section of the guide explores the common use cases for batch processing in Hadoop-based big data solutions. These include iterative exploration, data warehouse on demand, ETL automation, and BI integration. The guide focuses primarily on Microsoft Azure HDInsight, but the models described here can be easily adapted to many other big data frameworks.
Hadoop-based big data solutions open up new opportunities for converting data into information. They can also be used to extend existing information systems to provide additional insights through analytics and data visualization. Every organization is different, and so there is no definitive list of the ways you can use these types of solution as part of your own business processes.
However, there are four general use cases and corresponding models, described below, that are appropriate for the typical batch processing workloads on an HDInsight cluster. Understanding these use cases will help you to start making decisions on how best to integrate HDInsight with your organization, and with your existing BI systems and tools.
By incorporating additional applications that run under the YARN resource manager, HDInsight can be used to perform real-time processing of streaming data. However, this topic is outside the scope of the guide.
Use case 1: Iterative exploration
Figure 1 - The iterative exploration model
This model is typically chosen for experimenting with data sources to discover if they can provide useful information, and for handling data that you cannot process using existing systems. For example, you might collect feedback from customers through email, web pages, or external sources such as social media sites, then analyze it to get a picture of user sentiment for your products. You might be able to combine this information with other data, such as demographic data that indicates population density and characteristics in each city where your products are sold. For more details, see the Use case 1: Iterative exploration use case and batch processing model. For an example of using this model see Scenario 1: Iterative exploration.
Use case 2: Data warehouse on demand
Figure 2 - The data warehouse on demand model
Hadoop-based big data systems such as HDInsight allow you to store both the source data and the results of queries executed over this data. You can also store schemas (or, to be precise, metadata) for tables that are populated by the queries you execute. These tables can be indexed, although there is no formal mechanism for managing key-based relationships between them. However, you can create data repositories that are robust and reasonably low cost to maintain, which is especially useful if you need to store and manage huge volumes of data. For more details, see the Use case 2: Data warehouse on demand use case and batch processing model. For an example of using this model see Scenario 2: Data warehouse on demand.
Use case 3: ETL automation
Figure 3 - The ETL automation model
Hadoop-based big data systems such as HDInsight can be used to extract and transform data before you load it into your existing databases or data visualization tools. Such solutions are well suited to performing categorization and normalization of data, and for extracting summary results to remove duplication and redundancy. This is typically referred to as an Extract, Transform, and Load (ETL) process. For more details, see the Use case 3: ETL automation use case and batch processing model. For an example of using this model see Scenario 3: ETL automation.
Use case 4: BI integration
Figure 4 - The BI integration model
Enterprise-level data warehouses have some special characteristics that differentiate them from on-line transaction processing (OLTP) database systems, and so there are additional considerations for integrating with batch processing big data systems such as HDInsight. For example, you can integrate at different levels, depending on the way that you intend to use the data obtained from your big data solution. For more details, see the Use case 4: BI integration use case and batch processing model. For an example of using this model see Scenario 4: BI integration.