Stream processing with fully managed open-source data engines

Azure Event Hubs
Azure Kubernetes Service (AKS)
Azure Cosmos DB
Azure Database for PostgreSQL
Azure Cache for Redis

This article presents an example of a streaming solution that uses fully managed Azure data services.

Architecture

Architecture diagram showing how streaming data flows through a system. Kafka, Kubernetes, Cassandra, PostgreSQL, and Redis components make up the system.

Download a Visio file of this architecture.

Workflow

  1. The Event Hubs for Apache Kafka feature streams events from Kafka producers.

  2. Apache Spark consumes events. AKS provides a managed environment for the Apache Spark jobs.

  3. An application that uses Azure Cosmos DB for Apache Cassandra writes events to Cassandra. This database serves as a storage platform for events. AKS hosts the microservices that write to Cassandra.

  4. The change feed feature of Azure Cosmos DB processes events in real time.

  5. Scheduled applications run batch-oriented processing on events stored in Cassandra.

  6. Stores of reference data enrich event information. Batch-oriented applications write the enriched event information to PostgreSQL. Typical reference data stores include:

  7. A batch-oriented application processes Cassandra data. That application stores the processed data in Azure Database for PostgreSQL. This relational data store provides data to downstream applications that require enriched information.

  8. Reporting applications and tools analyze the PostgreSQL database data. For example, Power BI connects to the database by using the Azure Database for PostgreSQL connector. This reporting service then displays rich visuals of the data.

  9. Azure Cache for Redis provides an in-memory cache. In this solution, the cache contains data on critical events. An application stores data to the cache and retrieves data from the cache.

  10. Websites and other applications use the cached data to improve response times. Sometimes data isn't available in the cache. In those cases, these applications use the cache-aside pattern or a similar strategy to retrieve data from Cassandra in Azure Cosmos DB.

Components

  • Event Hubs is a fully managed streaming platform that can process millions of events per second. Event Hubs provides an endpoint for Apache Kafka, a widely used open-source stream-processing platform. When organizations use the endpoint feature, they don't need to build and maintain Kafka clusters for stream processing. Instead, they can benefit from the fully managed Kafka implementation that Event Hubs offers.

  • Azure Cosmos DB is a fully managed NoSQL and relational database that offers multi-master replication. Azure Cosmos DB supports open-source APIs for many databases, languages, and platforms. Examples include:

    Through the Azure Cosmos DB for Apache Cassandra, you can access Azure Cosmos DB data by using Apache Cassandra tools, languages, and drivers. Apache Cassandra is an open-source NoSQL database that's well suited for heavy write-intensive workloads.

  • Azure Kubernetes Service (AKS) is a highly available, secure, and fully managed Kubernetes service. Kubernetes is a rapidly evolving open-source platform for managing containerized workloads. AKS hosts open-source big data processing engines such as Apache Spark. By using AKS, you can run large-scale stream processing jobs in a managed environment.

  • Azure Database for PostgreSQL is a fully managed relational database service. It provides high availability, elastic scaling, patching, and other management capabilities for PostgreSQL. PostgreSQL is a widely adopted open-source relational database management system.

  • Azure Cache for Redis provides an in-memory data store based on the Redis software. Redis is a popular open-source in-memory data store. Session stores, content caches, and other storage components use Redis to improve performance and scalability. Azure Cache for Redis provides open-source Redis capabilities as a fully managed offering.

Alternatives

You can replace the open-source-compatible products and services in this solution with others. For details on open-source services available in Azure, see Open source on Azure.

Scenario details

Fully managed Azure data services that run open-source engines make up this streaming solution:

  • Azure Event Hubs offers a Kafka implementation for stream ingestion.
  • Azure Cosmos DB supports event storage in Cassandra.
  • AKS hosts Kubernetes microservices for stream processing.
  • Azure Database for PostgreSQL manages relational data storage in PostgreSQL.
  • Azure Cache for Redis manages Redis in-memory data stores.

Open-source technologies offer many benefits. For instance, organizations can use open-source technologies to:

  • Migrate existing workloads.
  • Tap into the broad open-source community.
  • Limit vendor lock-in.

By making open-source technologies accessible, Azure tools and services help organizations take advantage of these benefits and develop the solutions of their choice.

This solution uses fully managed platform as a service (PaaS) services. As a result, Microsoft handles patching, service-level agreement (SLA) maintenance, and other management tasks. Another benefit is the native integration with the Azure security infrastructure.

Potential use cases

This solution applies to various scenarios:

  • Using Azure PaaS services to build modern streaming solutions that use open-source technologies
  • Migrating open-source stream processing solutions to Azure

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.

Design and implement each service with best practices in mind. For guidelines on each service, see the Microsoft documentation site. Also review the information in the following sections:

Performance

  • Implement connection pooling for Azure Database for PostgreSQL. You can use a connection pooling library within the application. Or you can use a connection pooler such as PgBouncer or Pgpool. Establishing a connection with PostgreSQL is an expensive operation. With connection pooling, you can avoid degrading application performance. PgBouncer is built-in in Azure Database for PostgreSQL Flexible Server.

  • Configure Azure Cosmos DB for Apache Cassandra for best performance by using an appropriate partitioning strategy. Decide whether to use a single field primary key, a compound primary key, or a composite partition key when partitioning tables.

Scalability

  • Take your streaming requirements into account when choosing an Event Hubs tier:

    • For mid-range throughput requirements of less than 120 MBps, consider the Premium tier. This tier scales elastically to meet streaming requirements.
    • For high-end streaming workloads with an ingress of gigabytes of data, consider the Dedicated tier. This tier is a single-tenant offering with a guaranteed capacity. You can scale dedicated clusters up and down.
  • Consider autoscale-provisioned throughput for Azure Cosmos DB if your workloads are unpredictable and spiky. You can configure Azure Cosmos DB to use manually provisioned throughput or autoscale-provisioned throughput. With autoscale, Azure automatically and instantly scales the request units per second according to your usage.

Security

Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.

  • Use Azure Private Link to make Azure services part of your virtual network. When you use Private Link, traffic between the services and your network flows over the Azure backbone without traversing the public internet. The Azure services in this solution support Private Link for selected SKUs.

  • Check your organization's security policies. With Azure Cosmos DB for Apache Cassandra, keys provide access to resources like key spaces and tables. The Azure Cosmos DB instance stores those keys. Your security policies might require you to propagate those keys to a key management service such as Azure Key Vault. Also make sure to rotate keys according to your organization's policies.

Resiliency

Consider using Availability zones to protect business-critical applications from datacenter failures. This solution's services support availability zones for selected SKUs in availability zone–enabled regions. For up-to-date information, review the list of services that support availability zones.

Cost optimization

Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.

To estimate the cost of this solution, use the Azure pricing calculator. Also keep these points in mind:

  • Event Hubs is available in Basic, Standard, Premium, and Dedicated tiers. The Premium or Dedicated tier is best for large-scale streaming workloads. You can scale throughput, so consider starting small and then scaling up as demand increases.

  • Azure Cosmos DB offers two models:

    • A provisioned throughput model that's ideal for demanding workloads. This model is available in two capacity management options: standard and autoscale.
    • A serverless model that's well suited for running small, spiky workloads.
  • An AKS cluster consists of a set of nodes, or virtual machines (VMs), that run in Azure. The cost of the compute, storage, and networking components make up a cluster's primary costs.

  • Azure Database for PostgreSQL is available in Single Server and Flexible Server tiers. Different tiers cater to different scenarios, such as predicable, burstable, and high-performance workloads. The costs mainly depend on the choice of compute nodes and storage capacity. For new workloads, consider choosing the Flexible Server tier since it has a wider range of supported capabilities over the Single Server tier. Also note that Single Server is on the path to deprecation.

  • Azure Cache for Redis is available in multiple tiers. These tiers accommodate caches that range from 250 megabytes to several terabytes. Besides size, other requirements also affect the choice of tier:

    • Clustering
    • Persistence
    • Active geo-replication

Deploy this scenario

Keep these points in mind when you deploy this solution:

Contributors

This article is being updated and maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Next steps

To learn about related solutions, see the following information: