Apache Kafka is a highly scalable and fault tolerant distributed messaging system that implements a publish-subscribe architecture. It's used as an ingestion layer in real-time streaming scenarios, such as IoT and real-time log monitoring systems. It's also used increasingly as the immutable append-only data store in Kappa architectures.
Apache®, Apache Spark®, Apache Hadoop®, Apache HBase, Apache Storm®, Apache Sqoop®, Apache Kafka®, and the flame logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
This article presents various strategies for migrating Kafka to Azure:
- Migrate Kafka to Azure infrastructure as a service (IaaS)
- Migrate Kafka to Azure Event Hubs for Kafka
- Migrate Kafka on Azure HDInsight
- Use AKS with Kafka on HDInsight
Here's a decision flowchart for deciding which to use:
Migrate Kafka to Azure infrastructure as a service (IaaS)
For one way to migrate Kafka to Azure IaaS, see Kafka on Ubuntu VMs.
Migrate Kafka to Azure Event Hubs for Kafka
Event Hubs provides an endpoint that's compatible with the Apache Kafka producer and consumer APIs. This endpoint can be used by most Apache Kafka client applications, so it's an alternative to running a Kafka cluster on Azure. The endpoint supports clients that use versions 1.0 and later of the APIs. For more information about this feature, see Azure Event Hubs for Apache Kafka overview.
To learn how to migrate your Apache Kafka applications to use Azure Event Hubs, see Migrate to Azure Event Hubs for Apache Kafka Ecosystems.
Kafka and Event Hubs feature differences
|How are Kafka and Event Hubs similar?
|How are Kafka and Event Hubs different?
|Both use partitions.
|There are differences in these areas:
|Partitions are independent.
|• PaaS vs. software
|Both use a client-side cursor concept.
|Both can scale to very high workloads.
|Conceptually they are nearly the same.
|Neither uses the HTTP protocol for receiving.
|Scale is managed by partition count.
|Scale is managed by throughput units.
|You must load-balance partitions across machines.
|Load balancing is automatic.
|You must manually re-shard by using split and merge.
|Repartitioning isn't required.
|Volatile by default
|Replicated after ACK
|Replicated before ACK
|Depends on disk and quorum
|Provided by storage
|SSL and SASL
|SAS and SASL/PLAIN RFC 4618
|Optional transport encryption
|Token based (unlimited)
|Kafka doesn't throttle.
|Event Hubs supports throttling.
|Kafka uses a proprietary protocol.
|Event Hubs uses AMQP 1.0 protocol.
|Kafka doesn't use HTTP for send.
|Event Hubs uses HTTP Send and Batch Send.
Migrate Kafka on Azure HDInsight
You can migrate Kafka to Kafka on Azure HDInsight. For more information, see What is Apache Kafka in Azure HDInsight?.
Use AKS with Kafka on HDInsight
Kafka Data Migration
You can use Kafka's MirrorMaker tool to replicate topics from one cluster to another. This technique can help you migrate data after a Kafka cluster is provisioned. For more information, see Use MirrorMaker to replicate Apache Kafka topics with Kafka on HDInsight.
Here's a migration approach that uses mirroring:
- Move producers first and then move consumers. When you migrate the producers you prevent production of new messages on the source Kafka.
- After the source Kafka consumes all remaining messages, you can migrate the consumers.
Here are the implementation steps:
- Change the Kafka connection address of the producer client to point to the new Kafka instance.
- Restart the producer business services and send new messages to the new Kafka instance.
- Wait for the data in the source Kafka to be consumed.
- Change the Kafka connection address of the consumer client to point to the new Kafka instance.
- Restart the consumer business services to consume messages from the new Kafka instance.
- Verify that consumers succeed in getting data from the new Kafka instance.
Monitor the Kafka cluster
You can use Azure Monitor logs to analyze logs that are generated by Apache Kafka on HDInsight. For more information, see: Analyze logs for Apache Kafka on HDInsight.
Apache Kafka Streams API
The Kafka Streams API makes it possible to process data in near real-time, and it provides the ability to join and aggregate data. There are many more features of the API worth knowing about. For more information, see Introducing Kafka Streams: Stream Processing Made Simple - Confluent.
The Microsoft and Confluent partnership
Confluent provides a cloud-native service for Apache Kafka. Microsoft and Confluent have a strategic alliance. For more information, see:
- Confluent and Microsoft Announce Strategic Alliance
- Introducing seamless integration between Microsoft Azure and Confluent Cloud
This article is maintained by Microsoft. It was originally written by the following contributors.
- Namrata Maheshwary | Senior Cloud Solution Architect
- Raja N | Director, Customer Success
- Hideo Takagi | Cloud Solution Architect
- Ram Yerrabotu | Senior Cloud Solution Architect
- Ram Baskaran | Senior Cloud Solution Architect
- Jason Bouska | Senior Software Engineer
- Eugene Chung | Senior Cloud Solution Architect
- Pawan Hosatti | Senior Cloud Solution Architect - Engineering
- Daman Kaur | Cloud Solution Architect
- Danny Liu | Senior Cloud Solution Architect - Engineering
- Jose Mendez Senior Cloud Solution Architect
- Ben Sadeghi | Senior Specialist
- Sunil Sattiraju | Senior Cloud Solution Architect
- Amanjeet Singh | Principal Program Manager
- Nagaraj Seeplapudur Venkatesan | Senior Cloud Solution Architect - Engineering
To see non-public LinkedIn profiles, sign in to LinkedIn.
Azure product introductions
- Introduction to Azure Data Lake Storage Gen2
- What is Apache Spark in Azure HDInsight?
- What is Apache Hadoop in Azure HDInsight?
- What is Apache HBase in Azure HDInsight?
- What is Apache Kafka in Azure HDInsight?
- Overview of enterprise security in Azure HDInsight
Azure product reference
- Microsoft Entra documentation
- Azure Cosmos DB documentation
- Azure Data Factory documentation
- Azure Databricks documentation
- Azure Event Hubs documentation
- Azure Functions documentation
- Azure HDInsight documentation
- Microsoft Purview data governance documentation
- Azure Stream Analytics documentation
- Azure Synapse Analytics
- Enterprise Security Package for Azure HDInsight
- Develop Java MapReduce programs for Apache Hadoop on HDInsight
- Use Apache Sqoop with Hadoop in HDInsight
- Overview of Apache Spark Streaming
- Structured Streaming tutorial
- Use Azure Event Hubs from Apache Kafka applications