Planning a big data solution
From: Developing big data solutions on Microsoft Azure HDInsight
Big data solutions such as Microsoft Azure HDInsight can help you discover vital information that may otherwise have remained hidden in your data—or even been lost forever. This information can help you to evaluate your organization’s historical performance, discover new opportunities, identify operational efficiencies, increase customer satisfaction, and even predict likely outcomes for the future. It’s not surprising that big data is generating so much interest and excitement in what some may see as the rather boring world of data management.
This section of the guide focuses on the practicalities of planning your big data solutions. This means you need to think about what you hope to achieve from them, even if your aim is just to explore the kinds of information you might be able to extract from available data, and how your solution will fit into your existing business infrastructure. It may be that you just want to use it alongside your existing business intelligence (BI) systems, or you may want to deeply integrate it with these systems. The important point is that, irrespective of how you choose to use it, the end result is the same: some kind of analysis of the source data and meaningful visualization of the results.
Many organizations already use data to improve decision making through existing BI solutions that analyze data generated by business activities and applications, and create reports based on this analysis. Rather than seeking to replace traditional BI solutions, big data provides a way to extend the value of your investment in BI by enabling you to incorporate a much wider variety of data sources that complement and integrate with existing data warehouse, analytical data models, and business reporting solutions.
The topics in this section provide an overview of the typical stages of planning, designing, implementing, and using a Hadoop-based big data batch processing mechanism such as HDInsight. For each stage you’ll find more details of the common concerns you must address, and pointers to help you make the appropriate choices.
An overview of the big data process
Designing and implementing a big data batch processing solution typically involves a common collection of stages, irrespective of the type of source data and the ultimate aims for obtaining information from that data. You may not carry out every one of these stages, or execute them in a specific order, but you should consider all of these aspects as you design and implement your solutions.
In more detail, the stages are:
- Decide if big data is the appropriate solution. There are some tasks and scenarios for which big data batch-processing solutions based on Hadoop are ideally suited, while other scenarios may be better accomplished using a more traditional data management mechanism such as a relational database. For more details, see Is big data the right solution?
- Determine the analytical goals and source data. Before you start any data analysis project, it is useful to be clear about what you hope to achieve from it. You may have a specific question that you need to answer in order to make a critical business decision; in which case you must identify data that may help you determine the answer, where it can be obtained from, and if there are any costs associated with procuring it. Alternatively, you may already have some data that you want to explore to try to discern useful trends and patterns. Either way, understanding your goals will help you design and implement a solution that best supports those goals. For more details, see Determining analytical goals and Identifying source data.
- Design the architecture. While every data analysis scenario is different, and your requirements will vary, there are some basic use cases and models that are best suited to specific scenarios. For example, your requirements may involve a data analysis process followed by data cleansing and validation, perhaps as a workflow of tasks, before transferring the results to another system. This may form the basis for a mechanism that, for example, changes the behavior of an application based on user preferences and patterns of behavior collected as they use the application. For more details of the core use cases and models, see Designing big data solutions using HDInsight.
- Specify the infrastructure and cluster configuration. This involves choosing the appropriate big data software, or subscribing to an online service such as HDInsight. You will also need to determine the appropriate cluster size, storage requirements, consider if you will need to delete and recreate the cluster as part of your management process, and ensure that your chosen solution will meet SLAs and business operational requirements. For more details, see Specifying the infrastructure.
- Obtain the data and submit it to the cluster. During this stage you decide how you will collect the data you have identified as the source, and how you will load it into your big data solution for processing. Often you will store the data in its raw format to avoid losing any useful contextual information it contains, though you may choose to do some pre-processing before storing it to remove duplication or to simplify it in some other way. For more details, see Collecting and loading data into HDInsight.
- Process the data. After you have started to collect and store the data, the next stage is to develop the processing solutions you will use to extract the information you need. You can usually use Hive and Pig queries, or other processing tools, for even quite complex data extraction. In a few rare circumstances you may need to create custom map/reduce components to perform more complex queries against the data. For more details, see Processing, querying, and transforming data using HDInsight.
- Evaluate the results. Probably the most important step of all is to ensure that you are getting the results you expected, and that these results make sense. Complex queries can be hard to write, and difficult to get right the first time. It’s easy to make assumptions or miss edge cases that can skew the results quite considerably. Of course, it may be that you don’t know what the expected result actually is (after all, the whole point of big data is to discover hidden information from the data) but you should make every effort to validate the results before making business decisions based on them. In many cases, a business user who is familiar enough with the business context can perform the role of a data steward and review the results to verify that they are meaningful, accurate, and useful.
- Tune the solution. At this stage, if the solution you have created is working correctly and the results are valuable, you should decide whether you will repeat it in the future; perhaps with new data you collect over time. If so, you should tune the solution by reviewing the log files it creates, the processing techniques you use, and the implementation of the queries to ensure that they are executing in the most efficient way. It’s possible to fine tune big data solutions to improve performance, reduce network load, and minimize the processing time by adjusting some parameters of the query and the execution platform, or by compressing the data that is transferred over the network.
- Visualize and analyze the results. Once you are satisfied that the solution is working correctly and efficiently, you can plan and implement the analysis and visualization approach you require. This may be loading the data directly into an application such as Microsoft Excel, or exporting it into a database or enterprise BI system for further analysis, reporting, charting, and more. For more details, see Consuming and visualizing data from HDInsight.
- Automate and manage the solution. At this point it will be clear if the solution should become part of your organization’s business management infrastructure, complementing the other sources of information that you use to plan and monitor business performance and strategy. If this is the case, you should consider how you might automate and manage some or all of the solution to provide predictable behavior, and perhaps so that it is executed on a schedule. For more details, see Building end-to-end solutions using HDInsight.
Note that, in many ways, data analysis is an iterative process; and you should take this approach when building a big data batch processing solution. In particular, given the large volumes of data and correspondingly long processing times typically involved in big data analysis, it can be useful to start by implementing a proof of concept iteration in which a small subset of the source data is used to validate the processing steps and results before proceeding with a full analysis. This enables you to test your big data processing design on a small cluster, or even on a single-node on-premises cluster, before scaling out to accommodate production level data volumes.
Note
It’s easy to run queries that extract data, but it’s vitally important that you make every effort to validate the results before using them as the basis for business decisions. If possible you should try to cross reference the results with other sources of similar information.