Edit

Share via


Apache Kafka migration to Azure

Azure HDInsight
Azure Cosmos DB
Azure Data Lake Storage
Azure Stream Analytics

Apache Kafka is a highly scalable and fault tolerant distributed messaging system that implements a publish-subscribe architecture. It's used as an ingestion layer in real-time streaming scenarios, such as Internet of Things and real-time log monitoring systems. It's also used increasingly as the immutable append-only data store in Kappa architectures.

Apache®, Apache Spark®, Apache Hadoop®, Apache HBase, Apache Storm®, Apache Sqoop®, Apache Kafka®, and the flame logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Migration approach

This article presents various strategies for migrating Kafka to Azure:

Here's a decision flowchart for deciding which strategy to use.

Diagram that shows a decision chart for determining a strategy for migrating Kafka to Azure.

Migrate Kafka to Azure IaaS

For one way to migrate Kafka to Azure IaaS, see Kafka on Ubuntu virtual machines.

Migrate Kafka to Event Hubs for Kafka

Event Hubs provides an endpoint that's compatible with the Apache Kafka producer and consumer APIs. Most Apache Kafka client applications can use this endpoint, so you can use it as an alternative to running a Kafka cluster on Azure. The endpoint supports clients that use API versions 1.0 and later. For more information about this feature, see Event Hubs for Apache Kafka overview.

To learn how to migrate your Apache Kafka applications to use Event Hubs, see Migrate to Event Hubs for Apache Kafka ecosystems.

Features of Kafka and Event Hubs

Similarities between Kafka and Event Hubs Differences in Kafka and Event Hubs
Use partitions Platform as a service versus software
Partitions are independent Partitioning
Use a client-side cursor concept APIs
Can scale to very high workloads Runtime
Nearly identical conceptually Protocols
Neither uses the HTTP protocol for receiving Durability
Security
Throttling
Partitioning differences
Kafka Event Hubs
Partition count manages scale. Throughput units manage scale.
You must load balance partitions across machines. Load balancing is automatic.
You must manually reshard by using split and merge. Repartitioning isn't required.
Durability differences
Kafka Event Hubs
Volatile by default Always durable
Replicated after an acknowledgment (ACK) is received Replicated before an ACK is sent
Depends on disk and quorum Provided by storage
Security differences
Kafka Event Hubs
Secure Sockets Layer (SSL) and Simple Authentication and Security Layer (SASL) Shared Access Signature (SAS) and SASL or PLAIN RFC 4618
File-like access control lists Policy
Optional transport encryption Mandatory Transport Layer Security (TLS)
User based Token based (unlimited)
Other differences
Kafka Event Hubs
Doesn't throttle Supports throttling
Uses a proprietary protocol Uses AMQP 1.0 protocol
Doesn't use HTTP for send Uses HTTP send and batch send

Migrate Kafka on HDInsight

You can migrate Kafka to Kafka on HDInsight. For more information, see What is Apache Kafka in HDInsight?.

Use AKS with Kafka on HDInsight

For more information, see Use AKS with Apache Kafka on HDInsight.

Use Kafka on AKS with the Strimzi Operator

For more information, see Deploy a Kafka cluster on AKS by using Strimzi.

Kafka data migration

You can use Kafka's MirrorMaker tool to replicate topics from one cluster to another. This technique can help you migrate data after a Kafka cluster is provisioned. For more information, see Use MirrorMaker to replicate Apache Kafka topics with Kafka on HDInsight.

The following migration approach uses mirroring:

  1. Move producers first. When you migrate the producers, you prevent production of new messages on the source Kafka.

  2. After the source Kafka consumes all remaining messages, you can migrate the consumers.

The implementation includes the following steps:

  1. Change the Kafka connection address of the producer client to point to the new Kafka instance.

  2. Restart the producer business services and send new messages to the new Kafka instance.

  3. Wait for the data in the source Kafka to be consumed.

  4. Change the Kafka connection address of the consumer client to point to the new Kafka instance.

  5. Restart the consumer business services to consume messages from the new Kafka instance.

  6. Verify that consumers succeed in getting data from the new Kafka instance.

Monitor the Kafka cluster

You can use Azure Monitor logs to analyze logs that Apache Kafka on HDInsight generates. For more information, see Analyze logs for Apache Kafka on HDInsight.

Apache Kafka Streams API

The Kafka Streams API makes it possible to process data in near real-time and to join and aggregate data. For more information, see Introducing Kafka Streams: Stream Processing Made Simple - Confluent.

The Microsoft and Confluent partnership

Confluent provides a cloud-native service for Apache Kafka. Microsoft and Confluent have a strategic alliance. For more information, see the following resources:

Contributors

Microsoft maintains this article. The following contributors wrote this article.

Principal authors:

Other contributors:

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps

Azure product introductions

Azure product reference

Other