Networking Issue on Azure HDInsight Spark Cluster with ESP

Christoph Kiefer 141 Reputation points
2020-08-24T09:42:23.157+00:00

Dear All

We encounter an issue with networking / DNS on our Azure HDInsight Spark cluster. The cluster is joined to our AAD (i.e., it's a cluster with ESP enabled).

The cluster gets automatically created with a PS runbook and ARM template file. This is the last line of the runbook to give you an idea:
New-AzureRmResourceGroupDeployment -Name ${clusterName}${deployTime} -ResourceGroupName $ResourceGroupName -TemplateUri $templateUri -TemplateParameterObject $parameters

This process works fine and provisions the cluster into our Vnet. The Vnet has custom DNS setup.

Description of the issue:

First, perform command hostname -f on the primary headnode of the cluster. This returns something like hn0-prdupc.domainn.onmicrosoft.com

Second, run command cat /etc/hosts on the primary headnode:

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
10.250.0.23     hn0-prdupc.domain.onmicrosoft.com    hn0-prdupc.axa2i4dkt35e1ksvyarqcgbjjb.ax.internal.cloudapp.net     headnodehost    hn0-prdupc.domain.onmicrosoft.com.      hn0-prdupc      headnodehost.   # SlaveNodeManager
10.250.0.17     wn0-prdupc.domain.onmicrosoft.com wn0-prdupc wn0-prdupc.domain.onmicrosoft.com. wn0-prdupc.axa2i4dkt35e1ksvyarqcgbjjb.ax.internal.cloudapp.net
...

Third, perform command nslookup hn0-prdupc.domain.onmicrosoft.com

nslookup hn0-prdupc.domain.onmicrosoft.com
Server:         10.90.80.4
Address:        10.90.80.4#53

Name:   hn0-prdupc.domain.onmicrosoft.com
Address: 10.250.0.22

For some reason, whatsoever, the answer from the DNS lookup shows a wrong IP address (maybe one from a previous provision step?)
The IP address obtained from the nslookup is different from the IP address in the /etc/hosts file and different from the output of command ifconfig.

I am not an expert, but how is that supposed to work when clusters are created? How / When are DNS entries supposed to be updated in that whole provisioning process? Where / How shall we start to tackle that issue?

Any help / pointers / references to resolve this issue are highly appreciated.

Christoph

Azure HDInsight
Azure HDInsight
An Azure managed cluster service for open-source analytics.
199 questions
{count} votes

2 answers

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 77,751 Reputation points Microsoft Employee
    2020-08-31T04:23:01.807+00:00

    Hello @Christoph Kiefer ,

    There is a DNS server in AAD-DS and all domain joined VMs will have an entry there. It is possible the previous deployment of a cluster with the same <clustername> has the dns entries that is not cleaned up properly from AAD-DS DNS entries. So, I would suggest to rely on the /etc/hosts in this case. You can also look at the NIC resource created for the head node in your resource group to determine the right IP address.

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

    0 comments No comments

  2. Christoph Kiefer 141 Reputation points
    2020-08-31T06:40:38.887+00:00

    Hi @PRADEEPCHEEKATLA-MSFT

    Thx for the reply.
    Unfortunately, that's not an acceptable answer. First, there's still an issue with other VM's in the same vnet that, for instance, need to connect to this cluster (headnode) to, for instance, send curl calls to Ambari REST API. This call fails because of wrong DNS entries:

     curl -v -u : --negotiate -k "http://hn0-prdupc.domain.onmicrosoft.com:8080/api/v1/clusters/prdupcchbisp1/hosts"  
    

    Also, this call fails because of, as I speculate, Kerberos that URGENTLY NEEDS proper DNS entries:

    curl -v -u : --negotiate -k "http://10.250.0.23:8080/api/v1/clusters/prdupcchbisp1/hosts"  
    

    Also this attempt to connect to the Spark cluster through Livy in RStudio (using Kerberos) fails:

    sc <- sparklyr::spark_connect(master = "hn0-prdupc.domain.onmicrosoft.com:8998"  
                                  , method = "livy"  
                                  , config = livy_config(negotiate = TRUE)  
                                  , version = "2.3.0")  
    

    Second, it's just wrong imho to not properly update DNS entries when a new cluster (with the same cluster name) is re-created (which I think is natural to do).

    Side node: it's questionable why Microsoft is not offering cluster suspension and only offers cluster deletion (in contrast to Databricks clusters) but that's a different story.

    So why are DNS entries not cleaned up properly from our own custom DNS? How can we tackle that? Should that work normally without customers doing anything?

    Thx a lot
    Christoph