Use explicitly Unity catalog instead of hive metastore

George Alexiou 20 Reputation points Microsoft Employee
2023-03-22T09:34:22.52+00:00

I have a customer who is using Databricks with Unity Catalog metastore. All outbound internet traffic is blocked through firewall. When trying to run a job, databricks is trying to make a connection to consolidated-westeuropec2-prod-metastore-3.mysql.database.azure.com and it fails due to firewall. After whitelisting it temporarily for troubleshooting purposes, job runs successfully.

However, the customer ideally would like to block all connections to the internet, and he is wondering if this connection to Hive metastore is mandatory, given that he is using Unity Catalog. Is there a way to prevent this connection and explicitly use Unity Catalog metastore?

Thank you in advance for your help 🙂

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,917 questions
0 comments No comments
{count} votes

Accepted answer
  1. BhargavaGunnam-MSFT 26,136 Reputation points Microsoft Employee
    2023-03-22T22:53:00.5+00:00

    Hello @George Alexiou ,

    Welcome to the MS Q&A platform.

    The connection to the Hive metastore is required for Databricks to access the metadata of the tables stored in the Unity Catalog metastore. The Unity Catalog metastore stores the metadata of the tables in Databricks. However, Databricks still needs to connect to the Hive metastore to retrieve the metadata of the tables(mysql DB is used for storing other metadata in addition to hive metastore)

    If the customer wants to block all connections to the internet, they can consider setting up a private endpoint for the Hive metastore. A private endpoint is a network interface that connects to a private IP address in a VNet. By setting up a private endpoint for the Hive metastore, customer can ensure that all traffic to the metastore stays within their VNet and does not go over the internet.

    Below is the document to set up a private endpoint for the Hive metastore:

    https://docs.databricks.com/data/metastores/external-hive-metastore.html#step-3-create-a-private-endpoint-for-the-hive-metastore

    The other option is to use Secure Cluster Connectivity (No Public IP / NPIP). This will prevent Databricks from making connections to the internet. With this option, Databricks clusters are not assigned public IP addresses, and all inbound and outbound traffic is routed through a customer-managed virtual network.

    https://learn.microsoft.com/en-us/azure/databricks/security/network/secure-cluster-connectivity

    I hope this helps. Please let us know if you have any further questions.

    If this answers your question, please consider accepting the answer by hitting the Accept answer and up-vote as it helps the community look for answers to similar questions.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful