Choosing tools and technologies

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Choosing an appropriate tool or technology for uploading data can make the process more efficient, scalable, secure, and reliable. You can use scripts, tools, or custom utilities to perform the upload, perhaps driven by a scheduler such as Windows Scheduled Tasks. In addition, if you intend to automate the upload process, or the entire end-to-end solution, ensure that the tools you choose can be instantiated and controlled as part of this process.

Several tools are specific to the types of data they handle. For example, there are tools designed to handle the upload of relational data and server log files. The following sections explore some of these scenarios. A list of popular tools is included in Appendix A - Tools and technologies reference. For information on creating your own tools, see Custom data upload clients.

Interactive data ingestion

An initial exploration of data with a big data solution often involves experimenting with a relatively small volume of data. In this case there is no requirement for a complex, automated data ingestion solution, and the tasks of obtaining the source data, preparing it for processing, and uploading it to cluster storage can be performed interactively.

You can upload the source data interactively using:

  • A UI-based tool such as CloudBerry Explorer, Microsoft Azure Storage Explorer, or Server Explorer in Visual Studio. For a useful list of third party tools for uploading data to HDInsight interactively see Upload data for Hadoop jobs in HDInsight on the Azure website.
  • PowerShell commands that take advantage of the PowerShell cmdlets for Azure. This capability is useful if you are just experimenting or working on a proof of concept.
  • The hadoop dfs -copyFromLocal [source] [destination] command at the Hadoop command line using a remote desktop connection.
  • A command line tool such as AzCopy if you need to upload large files.

Consider how you will handle very large volumes of data. While small volumes of data can be copied into storage interactively, you will need to choose or build a more robust mechanism capable of handling large files when you move beyond the experimentation stage.

Handling streaming data

Solutions that perform processing of streaming data, such as that arriving from device sensors or web clickstreams, must be able to either completely process the data on an event-by-event basis in real-time, or capture the events in a persistent or semi-persistent store so that they can be processed in batches. For small batches (sometimes referred to as micro-batch processing), the events might be captured in memory before being uploaded as small batches at short intervals to provide near real-time results. For larger batches, the data is likely to be stored in a persistent mechanism such as a database, disk file, or cloud storage.

However, some solutions may combine these two approaches by capturing events that are both redirected to a real-time visualization solution and stored for submission in batches to the big data solution for semi-real-time or historic analysis. This approach is also useful if you need to quickly detect some specific events in the data stream, but store the rest for batch analysis. For example, financial organizations may use real-time data processing to detect fraud or non-standard trading events, but also maintain the data to predict patterns and future.

The choice of tools and technologies depends on the platform and the type of processing you need to accomplish. Typical tools to capture or process stream data are:

  • Microsoft StreamInsight. This is a complex event processing (CEP) engine with a framework API for building applications that consume and process event streams. It can be run on-premises or in a virtual machine. For more information about developing StreamInsight applications, see Microsoft StreamInsight on MSDN.
  • Apache Storm. This is an open-source framework that can run on a Hadoop cluster to capture streaming data. It uses other Hadoop-related technologies such as Zookeeper to manage the data ingestion process. See the section “Overview of Storm” in the topic Data processing tools and techniques and Apache Storm on the Hortonworks website for more information.
  • Other open source frameworks such as Kafka, and Samza. These frameworks provide capabilities to capture streaming data and process it in real time, including persisting the data or messages as files for batch processing when required.
  • A custom event or stream capture solution that feeds the data into the cluster data store in real time or in batches. The interval should be based on the frequency that related query jobs will be instantiated. You could use the Reactive Extensions (Rx) library to implement a real-time stream capture utility.

For more information see Appendix A - Tools and technologies reference.

Loading relational data

Source data for analysis is often obtained from a relational database. It’s quite common to use HDInsight to query data extracted from a relational database. For example, you may use it to search for hidden information in data exported from your corporate database or data warehouse as part of a sandboxed experiment, without absorbing resources from the database or risking corruption of the existing data.

You can use Sqoop to extract the data you require from a table, view, or query in the source database and save the results as a file in your cluster storage. In HDInsight this approach makes it easy to transfer data from a relational database when your infrastructure supports direct connectivity from the HDInsight cluster to the database server, such as when your database is hosted in Azure SQL Database or a virtual machine in Azure.

Some business intelligence (BI) and data warehouse implementations contain interfaces that support connectivity to a big data cluster. For example, Microsoft Analytics Platform System (MAPS) contains PolyBase, which exposes a SQL-based interface for accessing data stored in Hadoop and HDInsight.

Loading web server log files

Analysis of web server log data is a common use case for big data solutions, and requires log files to be uploaded to the cluster storage. Flume is an open source project for an agent-based framework to copy log files to a central location, and is often used to load web server logs to HDFS for processing in Hadoop.

Note

Flume was not included in HDInsight when this guide was written, but can be downloaded from the Flume project website. See the blog post Using Apache Flume with HDInsight.

As an alternative to using Flume, you can use SSIS to implement an automated batch upload solution. For more details of using SQL Server Integration Services (SSIS) see Scenario 4: BI integration and Appendix A - Tools and technologies reference.

Next Topic | Previous Topic | Home | Community