Create Spark cluster in HDInsight on AKS (Preview)

Note

We will retire Azure HDInsight on AKS on January 31, 2025. Before January 31, 2025, you will need to migrate your workloads to Microsoft Fabric or an equivalent Azure product to avoid abrupt termination of your workloads. The remaining clusters on your subscription will be stopped and removed from the host.

Only basic support will be available until the retirement date.

Important

This feature is currently in preview. The Supplemental Terms of Use for Microsoft Azure Previews include more legal terms that apply to Azure features that are in beta, in preview, or otherwise not yet released into general availability. For information about this specific preview, see Azure HDInsight on AKS preview information. For questions or feature suggestions, please submit a request on AskHDInsight with the details and follow us for more updates on Azure HDInsight Community.

Once the subscription prerequisites and resource prerequisites  steps are complete, and you have a cluster pool deployed, continue to use the Azure portal to create a Spark cluster. You can use the Azure portal to create an Apache Spark cluster in cluster pool. You can then create a Jupyter Notebook and use it to run Spark SQL queries against Apache Hive tables.

  1. In the Azure portal, type cluster pools, and select cluster pools to go to the cluster pools page. On the cluster pools page, select the cluster pool in which you can add a new Spark cluster.

  2. On the specific cluster pool page, click + New cluster.

    Screenshot showing how to create new spark cluster.

    This step opens the cluster create page.

    Screenshot showing create cluster basic page.

    Property Description
    Subscription The Azure subscription that was registered for use with HDInsight on AKS in the Prerequisites section with be prepopulated
    Resource Group The same resource group as the cluster pool will be pre populated
    Region The same region as the cluster pool and virtual will be pre populated
    Cluster pool The name of the cluster pool will be pre populated
    HDInsight Pool version The cluster pool version will be pre populated from the pool creation selection
    HDInsight on AKS version Specify the HDI on AKS version
    Cluster type From the drop-down list, select Spark
    Cluster Version Select the version of the image version to use
    Cluster name Enter the name of the new cluster
    User-assigned managed identity Select the user assigned managed identity which will work as a connection string with the storage
    Storage Account Select the pre created storage account which is to be used as primary storage for the cluster
    Container name Select the container name(unique) if pre created or create a new container
    Hive Catalog (optional) Select the pre created Hive metastore(Azure SQL DB)
    SQL Database for Hive From the drop-down list, select the SQL Database in which to add hive-metastore tables.
    SQL admin username Enter the SQL admin username
    Key vault From the drop-down list, select the Key Vault, which contains a secret with password for SQL admin username
    SQL password secret name Enter the secret name from the Key Vault where the SQL DB password is stored

    Note

    • Currently HDInsight support only MS SQL Server databases.
    • Due to Hive limitation, "-" (hyphen) character in metastore database name is not supported.
  3. Select Next: Configuration + pricing to continue.

    Screenshot showing pricing tab 1.

    Screenshot showing pricing tab 2.

    Screenshot showing ssh tab.

    Property Description
    Node size Select the node size to use for the Spark nodes
    Number of worker nodes Select the number of nodes for Spark cluster. Out of those, three nodes are reserved for coordinator and system services, remaining nodes are dedicated to Spark workers, one worker per node. For example, in a five-node cluster there are two workers
    Autoscale Click on the toggle button to enable Autoscale
    Autoscale Type Select from either load based or schedule based autoscale
    Graceful decomission timeout Specify Graceful decommission timeout
    No of default worker node Select the number of nodes for autoscale
    Time Zone Select the time zone
    Autoscale Rules Select the day, start time, end time, no. of worker nodes
    Enable SSH If enabled, allows you to define Prefix and Number of SSH nodes
  4. Click Next : Integrations to enable and select Log Analytics for Logging.

    Azure Prometheus for monitoring and metrics can be enabled post cluster creation.

    Screenshot showing integration tab.

  5. Click Next: Tags to continue to the next page.

    Screenshot showing tags tab.

  6. On the Tags page, enter any tags you wish to add to your resource.

    Property Description
    Name Optional. Enter a name such as HDInsight on AKS Private Preview to easily identify all resources associated with your resources
    Value Leave this blank
    Resource Select All resources selected
  7. Click Next: Review + create.

  8. On the Review + create page, look for the Validation succeeded message at the top of the page and then click Create.

  9. The Deployment is in process page is displayed which the cluster is created. It takes 5-10 minutes to create the cluster. Once the cluster is created, Your deployment is complete message is displayed. If you navigate away from the page, you can check your Notifications for the status.

  10. Go to the cluster overview page, you can see endpoint links there.

    Screenshot showing cluster overview page.