This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.
This article outlines a solution for ingesting and processing millions of streaming events per second. Core components include Azure HDInsight, Apache Kafka, Apache Storm, and Apache Spark.
Apache®, Apache Kafka, Apache Storm, Apache Spark, Apache HBase, and the flame logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Download a Visio file of this architecture.
- Kafka ingests streaming data.
- Storm and Spark process the data.
- Apache HBase, which is a NoSQL database, stores results.
- Users consume the data in apps.
- The data is visualized in Power BI.
- HDInsight stores data in Azure Data Lake Storage for secure and scalable processing in the cloud.
Many Apache components are a good fit for systems that stream a large volume of data:
- Kafka is a widely used high-performance event-streaming platform.
- Storm is a computation system that quickly processes large volumes of data in real time.
- Spark is a data processing framework that uses in-memory data sharing.
- HBase is a schemaless database that provides random access and strong consistency for large amounts of data.
These components offer the added advantage of being open source. By using HDInsight, you can run these Apache components in an Azure environment.
HDInsight is an enterprise-scale analytics service in the cloud. This managed-cluster platform simplifies the process of running big data frameworks that use Apache components:
- You can use HDInsight to create optimized clusters for Spark, Kafka, and HBase.
- An HDInsight Spark cluster can use a Spark HBase connector to query an HDInsight HBase cluster.
- HDInsight also offers other benefits, including scalability, security, centralized monitoring, global availability, and extensibility.
Potential use cases
Companies can use this solution to retrieve or ingest data from multiple sources and make real-time business decisions. Scenarios include:
- Analyzing data from Internet of Things (IoT) sensors for quality detection, fault analysis, and maintenance event prediction.
- Business integration of weather feed or sensor data.
- Analysis of real-time stock market data.
- Analysis of current market conditions.
- Trend analysis over real-time sales.
The solution applies to the following industries:
This article is maintained by Microsoft. It was originally written by the following contributors.
- What is Azure HDInsight?
- Streaming at scale in HDInsight
- Quickstart: Create Apache Hadoop cluster in Azure HDInsight using Azure portal
- Quickstart: Create Apache Spark cluster in Azure HDInsight using Azure portal
- Tutorial: Use Apache HBase in Azure HDInsight
- Introduction to Azure Data Lake Storage Gen2
- Overview of enterprise security in Azure HDInsight
- Extend your on-premises big data investments with HDInsight
- Extract, transform, and load (ETL) using HDInsight
- Optimize marketing with machine learning
- Loan charge-off prediction with Azure HDInsight Spark clusters
- Interactive querying with HDInsight
- Azure Kubernetes in event stream processing
- Instant IoT data streaming with AKS