What is HDInsight?

Completed

The huge volumes, variety, and velocity at which data is generated today has brought about the need to use systems that can work effectively and efficiently with the semi and unstructured data generated. Attempts were made by traditional relational database management systems (RDBMS) to process, store and analyze “big data”. But, it was the world of Open Source Software (OSS) that made the break-through. OSS uses commodity hardware in a distributed manner, combined with software to scale data and analytics beyond the limits imposed on single servers.

OSS is freely available for both organizations and individuals to use. The lack of governance and support for OSS in the past has made it difficult for some enterprises to adopt. With the advent of the cloud, many cloud providers host these services and provide managed support to organizations that make use of OSS technologies. This proposition is compelling for organizations to reap the benefits of OSS without incurring the cost of managing and supporting it. It is common to see OSS in the space of big data. In this space many technologies exist not only to process and store data, but to also perform analytics. OSS analytics enables a multicloud, open application strategy that is not tied to a single cloud vendor. It provides portability whether you need to move solutions from on-premises to the cloud, or between different cloud vendors.

One of the core OSS analytical technologies used in big data solutions is Hadoop. It typically stores data in a Hadoop Distributed File System (HDFS) and uses a cluster of commodity computers, with a programming model named MapReduce. This programming model enables the distributed processing of large sets of data in a linear dataflow. For improved performance, Apache Spark builds on top of the architectural capabilities of Hadoop but replaces the MapReduce paradigm with Resilient Distributed Dataset (RDD). RDD provides an in-memory data engine that is much quicker.

It is worth noting that OSS analytics has gone beyond the traditional application of big data solutions with Hadoop and Spark. OSS analytics now incorporates a wide range of software including the following:

  • Kafka and Flink for streaming scenarios
  • Presto and Kylin as SQL abstraction layers
  • AI layers added with H20.ai and Dataiku

Azure HDInsight is a managed, full-spectrum, open-source analytics service in the cloud for enterprises. At Microsoft, OSS analytics is implemented within Azure HDInsight. You can use open-source frameworks such as Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka. You also get the benefits of enterprise-level security, monitoring capabilities, and high availability options that would be expected from a service hosted in Azure. Azure HDInsight is also extensible and customizable to deal with a range of customer scenarios.