Muokkaa

Share via


Use HDInsight and Delta Lake to manage event data

Azure HDInsight
Microsoft Entra ID
Azure Load Balancer
Azure ExpressRoute
Azure Virtual Network

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.

This article describes a solution that you can use to ingest and process millions of streaming events per second and then write the events to a Delta Lake table. This solution uses Apache Spark and Apache Kafka in Azure HDInsight.

Apache®, Apache Kafka, and Apache Spark are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries or regions. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Delta Lake Project is the registered trademark of The Linux Foundation in the United States and/or other countries.

Architecture

Diagram that shows the architecture for ingesting and processing streaming data.

Download a Visio file of this architecture.

The Jupyter Notebook logo is a trademark of their respective company. No endorsement is implied by the use of this mark.

Dataflow

The following dataflow corresponds to the preceding architecture.

  1. Real-time event data, such as IoT event data, is ingested to Apache Kafka via an Apache Kafka producer.

  2. Apache Spark Structured Streaming processes the data in near real-time.

  3. Apache Spark provides sinks for writing transformed and calculated analytics. Processed data is stored in an Azure Data Lake Storage account in Delta Lake table format.

  4. Processed data is continuously ingested into Apache Kafka.

  5. The data in the Azure Data Lake Storage account can provide insights for:

    • Near real-time dashboards in Power BI.
    • Azure Machine Learning to use with machine learning tools.
    • Jupyter Notebook by using PySpark or Scala to consume Delta Lake tables.

Components

  • HDInsight provides open-source components for enterprise analytics. You can run these Apache components in an Azure environment with enterprise-grade security. HDInsight also offers other benefits including scalability, security, centralized monitoring, global availability, and extensibility.

  • Apache Kafka in HDInsight is a managed open-source distributed platform that you can use to build real-time streaming data pipelines and applications. Apache Kafka provides high performance and durability so that you can group records into topics, partitions, and consumer groups and multiplex event streams from producers to consumers.

  • Apache Spark in HDInsight is a managed Microsoft implementation of Apache Spark in the cloud and is one of several Spark offerings in Azure.

  • Apache Spark Structured Streaming is a scalable, exactly-once fault-tolerance engine for processing streams. It's built on the Spark SQL engine. Structured Streaming queries are near real-time and have low latency. Apache Spark Structured Streaming provides several connectors for data sources and data sinks. You can also join multiple streams from various source types.

  • Apache Spark Structured Streaming in Apache Kafka is used to batch and stream queries and store them in a storage layer, a database, or Apache Kafka.

  • A Delta Lake storage layer provides reliability for data lakes by adding a transactional storage layer on top of data that's stored in cloud storage, such as Azure Storage. This storage layer extends Apache Parquet data files with file-based transaction logs. You can store data in Delta Lake table format to take advantage of benefits like atomicity, consistency, isolation, and durability (ACID) transactions, schema evolution, and history versions.

  • A Power BI Delta Lake table connector is used to read Delta Lake table data from Power BI.

  • Machine Learning is an Azure service where you can send the data that you collect to then use for machine learning models.

Scenario details

Event streaming is a continuous unbounded sequence of immutable events that flow from the event publisher to subscribers. In some business use cases, you must store these events in raw format and then clean, transform, and aggregate the events for various analytics needs. Use event streaming to perform near real-time processing and analysis of events, which generates immediate insights.

Potential use cases

This solution provides an opportunity for your business to process immutable exactly-once fault-tolerant event streams in near real-time. This approach uses Apache Kafka as an input source for Spark Structured Streaming and uses Delta Lake as a storage layer.

Business scenarios include:

  • Account sign-in fraud detection
  • Analysis of current market conditions
  • Analysis of real-time stock market data
  • Credit card fraud detection
  • Digital image and video processing
  • Drug research and discovery
  • Middleware for enterprise big data solutions
  • Short-sale risk calculation
  • Smart manufacturing and industrial IoT (IIoT)

This solution applies to the following industries:

  • Agriculture
  • Consumer packaged goods (CPG)
  • Cyber security
  • Finance
  • Healthcare
  • Insurance
  • Logistics
  • Manufacturing
  • Retail

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal authors:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps