Streaming using HDInsight

HDInsight

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.

This article outlines a solution for ingesting and processing millions of streaming events per second. Core components include Azure HDInsight, Apache Kafka, Apache Storm, and Apache Spark.

Apache®, Apache Kafka, Apache Storm, Apache Spark, Apache HBase, and the flame logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Architecture

Architecture diagram that shows how streaming data is ingested and processed in an Azure environment and then presented to users.

Download a Visio file of this architecture.

Dataflow

  • Kafka ingests streaming data.
  • Storm and Spark process the data.
  • Apache HBase, which is a NoSQL database, stores results.
  • Users consume the data in apps.
  • The data is visualized in Power BI.
  • HDInsight stores data in Azure Data Lake Storage for secure and scalable processing in the cloud.

Components

Scenario details

Many Apache components are a good fit for systems that stream a large volume of data:

  • Kafka is a widely used high-performance event-streaming platform.
  • Storm is a computation system that quickly processes large volumes of data in real time.
  • Spark is a data processing framework that uses in-memory data sharing.
  • HBase is a schemaless database that provides random access and strong consistency for large amounts of data.

These components offer the added advantage of being open source. By using HDInsight, you can run these Apache components in an Azure environment.

HDInsight is an enterprise-scale analytics service in the cloud. This managed-cluster platform simplifies the process of running big data frameworks that use Apache components:

  • You can use HDInsight to create optimized clusters for Spark, Kafka, and HBase.
  • An HDInsight Spark cluster can use a Spark HBase connector to query an HDInsight HBase cluster.
  • HDInsight also offers other benefits, including scalability, security, centralized monitoring, global availability, and extensibility.

Potential use cases

Companies can use this solution to retrieve or ingest data from multiple sources and make real-time business decisions. Scenarios include:

  • Analyzing data from Internet of Things (IoT) sensors for quality detection, fault analysis, and maintenance event prediction.
  • Business integration of weather feed or sensor data.
  • Analysis of real-time stock market data.
  • Analysis of current market conditions.
  • Trend analysis over real-time sales.

The solution applies to the following industries:

  • Agriculture
  • Retail
  • Finance
  • Insurance

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal authors:

Next steps