Configure storage and scalability for Apache Kafka on HDInsight

Learn how to configure the number of managed disks used by Apache Kafka on HDInsight.

Kafka on HDInsight uses the local disk of the virtual machines in the HDInsight cluster. Since Kafka is very I/O heavy, Azure Managed Disks is used to provide high throughput and provide more storage per node. If traditional virtual hard drives (VHD) were used for Kafka, each node is limited to 1 TB. With managed disks, you can use multiple disks to achieve 16 TB for each node in the cluster.

The following diagram provides a comparison between Kafka on HDInsight before managed disks, and Kafka on HDInsight with managed disks:

kafka with managed disks architecture.

Configure managed disks: Azure portal

  1. Follow the steps in the Create an HDInsight cluster to understand the common steps to create a cluster using the portal. Don't complete the portal creation process.

  2. From the Configuration & Pricing section, use the Number of Nodes field to configure the number of disks.

    Note

    The type of managed disk can be either Standard (HDD) or Premium (SSD). Premium disks are used with DS and GS series VMs. All other VM types use standard.

    cluster size section with the disks per worker node highlighted.

Configure managed disks: Resource Manager template

To control the number of disks used by the worker nodes in a Kafka cluster, use the following section of the template:

"dataDisksGroups": [
    {
        "disksPerNode": "[variables('disksPerWorkerNode')]"
    }
    ],

Next steps

For more information on working with Apache Kafka on HDInsight, see the following documents: