What is Microsoft HDInsight?
From: Developing big data solutions on Microsoft Azure HDInsight
Microsoft Azure HDInsight provides a pay-as-you-go solution for Hadoop-based big data batch processing that is cost-effective because you do not need to commit to installing and configuring on-premises infrastructure. You can instantiate and configure a Hadoop cluster in HDInsight when required, and remove it when it is not required. HDInsight uses a cluster of Azure virtual machines running the Hortonworks Data Platform (HDP), and it integrates with Azure blob storage.
Note
This guide is based on the version 3.0 (March 2014) release of HDInsight on Azure, but also includes some of the preview features that are available in later versions. Earlier and later releases of HDInsight may differ from the version described in this guide. For more information, see What's new in the Hadoop cluster versions provided by HDInsight? To sign up for the Azure service, go to HDInsight service home page.
Data storage
Big data solutions typically store data as a series of files located within a folder structure on disk. However, in HDInsight these files are stored in Azure blob storage. HDInsight supports the standard Hadoop file system commands and processes by using a fully HDFS-compliant layer over Azure blob storage. As far as Hadoop is concerned, storage operates in exactly the same way as when using a physical HDFS implementation. The advantages are that you can access storage using standard Azure blob storage techniques as well as through the HDFS layer, and the data can be persisted when the cluster is decommissioned.
HDInsight also offers the option to create a cluster that hosts the HBase open source data management system. HBase is a NoSQL wide-column data store implemented as distributed system that provides data processing and storage over multiple nodes in a Hadoop cluster. It provides a random, real-time, read/write data store designed to host tables that can contain billions of rows and millions of columns.
Note
For more information about how HDInsight uses blob storage, and the optional use of HBase, see “Data storage” in the topic Specifying the infrastructure.
Data processing
HDInsight supports many of the Hadoop query, transformation, and analysis tools, and you can install some additional tools and utilities on an HDInsight cluster if required. Examples of the tools and utilities commonly used with Hadoop-based solutions such as HDInsight are:
- Hive, which allows you to overlay a schema onto the data when you need to run a query, and use a SQL-like language called HiveQL for these queries. For example, you can use the CREATE TABLE command to build a table by splitting the text strings in the data using delimiters or at specific character locations, and then execute SELECT statements to extract the required data.
- Pig, which allows you to create schemas and execute queries by writing scripts in a high level language called Pig Latin. Pig Latin is a procedural language that processes relations by performing multiple interrelated data transformations that are explicitly encoded as data flow sequences.
- Map/reduce using components written in Java, and executed directly by the Hadoop framework. As an alternative you can use the Hadoop streaming interface to execute map and reduce components written in other languages such as C# and F#.
- Mahout is a machine learning library, which allows you to perform data mining queries that examine data files to extract specific types of information. For example, it supports recommendation mining (finding user’s preferences from their behavior), clustering (grouping documents with similar topic content), and classification (assigning new documents to a category based on existing categorization).
- Storm is a distributed real-time computation system for processing fast, large streams of data. It allows you to build trees and directed acyclic graphs (DAGs) that asynchronously process data items using a user-defined number of parallel tasks. It can be used for real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.
Note
At the time of writing, Mahout and Storm were not supported on HDInsight. For more information about the query and analysis tools in HDInsight see Processing, querying, and transforming data using HDInsight.
Data access and workflow
The tools and utilities installed in HDInsight, and available in the HDInsight and Azure SDKs, can help you to build a wide range of solutions. They include:
- An ODBC driver that can be used to connect any ODBC-enabled consumer (such as a database, or visualization tools such as Excel) with the data in Hive tables.
- A Linq To Hive implementation that allows LINQ queries to be executed over the data in HDInsight.
- HCatalog, which is used in conjunction with queries, such as those that use Hive and Pig, to abstract the physical paths to storage and make it easier to manage data and queries as a solution evolves.
- Sqoop, which can be used to import and export relational data to and from HDInsight.
- Oozie, which provides a mechanism for automating workflows and operations. It supports sequential and parallel workflow processes, and is extremely flexible.
Note
More information about these and other tools and utilities is available in subsequent sections of this guide, and in Appendix A - Tools and technologies reference.
Administration, automation, and monitoring
HDInsight contains a dashboard that provides rudimentary monitoring for clusters, a Hive editor where you can test your Hive queries, and some administration capabilities. It’s also possible to open a remote desktop connection to a cluster. However, the majority of administration, management, deployment, and query execution tasks are typically carried out by using the tools and utilities installed with HDInsight, the Azure and HDInsight PowerShell cmdlets, the classes in the HDInsight SDKs, and custom or third-party utilities.
The tools and utilities provided with, or available for download, allow you to carry out two distinct sets of tasks:
- Cluster management. This includes tasks such as creating and deleting clusters, and obtaining runtime monitoring information.
- Job execution. This includes uploading data and jobs, executing jobs, and downloading or accessing the results.
Cluster management makes use of Apache Zookeeper (which is used internally to manage some aspects of HDInsight) and the some features of the Ambari cluster monitoring framework.
The PowerShell cmdlets for Azure can be used to access blob storage to upload data to an HDInsight cluster, as well as performing administrative tasks related to managing your subscription and services. The PowerShell cmdlets for HDInsight allow full access to and management of almost all features of HDInsight.
SDKs are available for use in creating applications that perform management and job submission for HDInsight. The SDKs contain APIs that include classes for accessing storage, using HCatalog, automating tasks with Oozie, and accessing monitoring information through Ambari. The .NET SDK also contains a map/reduce implementation that uses the streaming interface to allow you to write queries in .NET languages,
In addition, there is a cross-platform command-line interface available that allows you to access HDInsight from different client platforms, and a management pack for Microsoft System Center.
Note
For more information about administration tools and techniques for HDInsight see Building end-to-end solutions using HDInsight and Appendix A - Tools and technologies reference.
More information
For an overview and description of HDInsight see Microsoft Big Data.
To sign up for the Azure HDInsight service, go to Azure HDInsight Service page.
For more information about using HDInsight, a good place to start is the TechNet library. You can see a list of articles related to HDInsight by searching the library using this URL: https://social.technet.microsoft.com/Search/en-US?query=hadoop.
The TechNet library contains articles related to HDInsight. Search for these using the URL https://social.technet.microsoft.com/Search/en-US?query=hadoop.
The official support forum for HDInsight is at https://social.msdn.microsoft.com/Forums/en-US/hdinsight/threads.