Collecting and loading data into HDInsight
This section of the guide explores how you can load data into your Hadoop-based big data solutions. It describes several different but typical data ingestion techniques that are generally applicable to any big data solution. These techniques include handling streaming data and automating the ingestion process. While the focus is primarily on Microsoft Azure HDInsight, many of the techniques described here are equally relevant to solutions built on other Hadoop frameworks and platforms.
Figure 1 shows an overview of the techniques and technologies related to this section of the guide.
Figure 1 - Overview of data ingestion techniques and technologies for HDInsight
For more details of the tools shown in Figure 1 see the tables in Appendix A - Tools and technologies reference.
The following topics in this section discuss the considerations for collecting and loading data into your big data solutions:
- Data types and data sources
- Cluster and storage initialization
- Performance and reliability
- Pre-processing and serializing the data
- Choosing tools and technologies
- Building custom clients
Security is also a fundamental concern in all computing scenarios, and big data processing is no exception. Security considerations apply during all stages of a big data process, and include securing data while in transit over the network, securing data in storage, and authenticating and authorizing users who have access to the tools and utilities you use as part of your process. For more details of how you can maximize security of your HDInsight solutions see the topic Security in the section Building end-to-end solutions using HDInsight.
Data types and data sources
The data sources for a big data solution are likely to be extremely variable. Typical examples of data sources are web clickstreams, social media, server logs, devices and sensors, and geo-location data. Some data may be persisted in a repository such as a database or a NoSQL store (including cloud-based storage), while other data may be accessible only as a stream of events.
There are specific tools designed to handle different types of data and different data sources. For example, streaming data may need to be captured and persisted so that it can be processed in batches. Data may also need to be staged prior to loading it so that it can be pre-processed to convert it into a form suitable for processing in a big data cluster.
However, you can collect and load almost any type of data from almost anywhere, even if you decide not to stage and/or pre-process the data. For example, you can use a custom Input Formatter to load data that is not exposed in a suitable format for the built-in Hadoop Input Formatters.
The blog page Analyzing Azure Table Storage data with HDInsight demonstrates how you can use a custom Input Formatter to collect and load data form Azure table storage.
For information about handling streaming data and pre-processing data, see the topic Pre-processing and serializing the data in this section of the guide. For information about choosing a tool specific to data sources such as relational databases and server log files, see the topic Choosing tools and technologies in this section of the guide.
When planning how you will obtain the source data for your big data solution, consider the following:
- You may need to load data from a range of different data sources such as websites, RSS feeds, clickstreams, custom applications and APIs, relational databases, and more. It’s vital to ensure that you can submit this data efficiently and accurately to cluster storage, including performing any preprocessing that may be required to capture the data and convert it into a suitable form.
- In some cases, such as when the data source is an internal business application or database, extracting the data into a file in a form that can be consumed by your solution is relatively straightforward. In the case of external data obtained from sources such as governments and commercial data providers, the data is often available for download in a suitable format. However, in other cases you may need to extract data through a web service or other API, perhaps by making a REST call or using code.
- You may need to stage data before submitting it to a big data cluster for processing. For example, you may want to persist streaming data so that it can be processed in batches, or collect data from more than one data source and combine the datasets before loading this into the cluster. Staging is also useful when combining data from multiple sources that have different formats and velocity (rate of arrival).
- Dedicated tools are available for handling specific types of data such as relational or server log data. See Choosing tools and technologies for more information.
For more information about HDInsight, see the Microsoft Azure HDInsight web page.
For a guide to uploading data to HDInsight, and some of the tools available to help, see Upload data to HDInsight on the HDInsight website.
For more details of how HDInsight uses Azure blob storage, see Use Microsoft Azure Blob storage with HDInsight on the HDInsight website.