Use HDInsight and Delta Lake to manage event data

Azure HDInsight

Microsoft Entra ID

Azure Load Balancer

Azure ExpressRoute

Azure Virtual Network

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.

This article describes a solution that you can use to ingest and process millions of streaming events per second and then write the events to a Delta Lake table. This solution uses Apache Spark and Apache Kafka in Azure HDInsight.

Apache®, Apache Kafka, and Apache Spark are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries or regions. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Delta Lake Project is the registered trademark of The Linux Foundation in the United States and/or other countries.

Architecture

Download a Visio file of this architecture.

The Jupyter Notebook logo is a trademark of their respective company. No endorsement is implied by the use of this mark.

Dataflow

The following dataflow corresponds to the preceding architecture.

Real-time event data, such as IoT event data, is ingested to Apache Kafka via an Apache Kafka producer.
Apache Spark Structured Streaming processes the data in near real-time.
Apache Spark provides sinks for writing transformed and calculated analytics. Processed data is stored in an Azure Data Lake Storage account in Delta Lake table format.
Processed data is continuously ingested into Apache Kafka.
The data in the Azure Data Lake Storage account can provide insights for:
- Near real-time dashboards in Power BI.
- Azure Machine Learning to use with machine learning tools.
- Jupyter Notebook by using PySpark or Scala to consume Delta Lake tables.

Components

HDInsight provides open-source components for enterprise analytics. You can run these Apache components in an Azure environment with enterprise-grade security. HDInsight also offers other benefits including scalability, security, centralized monitoring, global availability, and extensibility.
Apache Kafka in HDInsight is a managed open-source distributed platform that you can use to build real-time streaming data pipelines and applications. Apache Kafka provides high performance and durability so that you can group records into topics, partitions, and consumer groups and multiplex event streams from producers to consumers.
Apache Spark in HDInsight is a managed Microsoft implementation of Apache Spark in the cloud and is one of several Spark offerings in Azure.
Apache Spark Structured Streaming is a scalable, exactly-once fault-tolerance engine for processing streams. It's built on the Spark SQL engine. Structured Streaming queries are near real-time and have low latency. Apache Spark Structured Streaming provides several connectors for data sources and data sinks. You can also join multiple streams from various source types.
Apache Spark Structured Streaming in Apache Kafka is used to batch and stream queries and store them in a storage layer, a database, or Apache Kafka.
A Delta Lake storage layer provides reliability for data lakes by adding a transactional storage layer on top of data that's stored in cloud storage, such as Azure Storage. This storage layer extends Apache Parquet data files with file-based transaction logs. You can store data in Delta Lake table format to take advantage of benefits like atomicity, consistency, isolation, and durability (ACID) transactions, schema evolution, and history versions.
A Power BI Delta Lake table connector is used to read Delta Lake table data from Power BI.
Machine Learning is an Azure service where you can send the data that you collect to then use for machine learning models.

Scenario details

Event streaming is a continuous unbounded sequence of immutable events that flow from the event publisher to subscribers. In some business use cases, you must store these events in raw format and then clean, transform, and aggregate the events for various analytics needs. Use event streaming to perform near real-time processing and analysis of events, which generates immediate insights.

Potential use cases

This solution provides an opportunity for your business to process immutable exactly-once fault-tolerant event streams in near real-time. This approach uses Apache Kafka as an input source for Spark Structured Streaming and uses Delta Lake as a storage layer.

Business scenarios include:

Account sign-in fraud detection
Analysis of current market conditions
Analysis of real-time stock market data
Credit card fraud detection
Digital image and video processing
Drug research and discovery
Middleware for enterprise big data solutions
Short-sale risk calculation
Smart manufacturing and industrial IoT (IIoT)

This solution applies to the following industries:

Agriculture
Consumer packaged goods (CPG)
Cyber security
Finance
Healthcare
Insurance
Logistics
Manufacturing
Retail

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal authors:

Arun Sethia | Principal Program Manager
Sairam Yeturi | Principal Product Manager

To see non-public LinkedIn profiles, sign in to LinkedIn.

Share via

Use HDInsight and Delta Lake to manage event data

Architecture

Dataflow

Components

Scenario details

Potential use cases

Contributors

Next steps

Palaute

Palaute

Lisäresursseja

Share via

Use HDInsight and Delta Lake to manage event data

Architecture

Dataflow

Components

Scenario details

Potential use cases

Contributors

Next steps

Related resources

Palaute

Palaute

Lisäresursseja