External Apache Hive metastore (legacy)

This article describes how to set up Azure Databricks clusters to connect to existing external Apache Hive metastores. It provides information about recommended metastore setup and cluster configuration requirements, followed by instructions for configuring clusters to connect to an external metastore. For Hive library versions included in Databricks Runtime, see the relevant Databricks Runtime version release notes.

Important

  • While SQL Server works as the underlying metastore database for Hive 2.0 and above, the examples throughout this article use Azure SQL Database.
  • For information about Hive metastore compatibility with HDInsight, see Use external metadata stores in Azure HDInsight.
  • If you use Azure Database for MySQL as an external metastore, you must change the value of the lower_case_table_names property from 1 (the default) to 2 in the server-side database configuration. For details, see Identifier Case Sensitivity.

Note

Using external metastores is a legacy data governance model. Databricks recommends that you upgrade to Unity Catalog. Unity Catalog simplifies security and governance of your data by providing a central place to administer and audit data access across multiple workspaces in your account. See What is Unity Catalog?.

Hive metastore setup

The metastore client running inside a cluster connects to your underlying metastore database directly using JDBC.

To test network connectivity from a cluster to the metastore, you can run the following command inside a notebook:

%sh
nc -vz <DNS name> <port>

where

  • <DNS name> is the server name of Azure SQL Database.
  • <port> is the port of the database.

Cluster configurations

You must set two sets of configuration options to connect a cluster to an external metastore:

  • Spark options configure Spark with the Hive metastore version and the JARs for the metastore client.
  • Hive options configure the metastore client to connect to the external metastore.

Spark configuration options

Set spark.sql.hive.metastore.version to the version of your Hive metastore and spark.sql.hive.metastore.jars as follows:

  • Hive 0.13: do not set spark.sql.hive.metastore.jars.

    Note

    Hive 1.2.0 and 1.2.1 are not the built-in metastore on Databricks Runtime 7.0 and above. If you want to use Hive 1.2.0 or 1.2.1 with Databricks Runtime 7.0 and above, follow the procedure described in Download the metastore jars and point to them.

  • Hive 2.3.7 (Databricks Runtime 7.0 - 9.x) or Hive 2.3.9 (Databricks Runtime 10.0 and above): set spark.sql.hive.metastore.jars to builtin.

  • For all other Hive versions, Azure Databricks recommends that you download the metastore JARs and set the configuration spark.sql.hive.metastore.jars to point to the downloaded JARs using the procedure described in Download the metastore jars and point to them.

Download the metastore jars and point to them

  1. Create a cluster with spark.sql.hive.metastore.jars set to maven and spark.sql.hive.metastore.version to match the version of your metastore.

  2. When the cluster is running, search the driver log and find a line like the following:

    17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to <path>
    

    The directory <path> is the location of downloaded JARs in the driver node of the cluster.

    Alternatively you can run the following code in a Scala notebook to print the location of the JARs:

    import com.typesafe.config.ConfigFactory
    val path = ConfigFactory.load().getString("java.io.tmpdir")
    
    println(s"\nHive JARs are downloaded to the path: $path \n")
    
  3. Run %sh cp -r <path> /dbfs/hive_metastore_jar (replacing <path> with your cluster’s info) to copy this directory to a directory in DBFS root called hive_metastore_jar through the DBFS client in the driver node.

  4. Create an init script that copies /dbfs/hive_metastore_jar to the local filesystem of the node, making sure to make the init script sleep a few seconds before it accesses the DBFS client. This ensures that the client is ready.

  5. Set spark.sql.hive.metastore.jars to use this directory. If your init script copies /dbfs/hive_metastore_jar to /databricks/hive_metastore_jars/, set spark.sql.hive.metastore.jars to /databricks/hive_metastore_jars/*. The location must include the trailing /*.

  6. Restart the cluster.

Hive configuration options

This section describes options specific to Hive.

To connect to an external metastore using local mode, set the following Hive configuration options:

# JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionURL <mssql-connection-string>

# Username to use against metastore database
javax.jdo.option.ConnectionUserName <mssql-username>

# Password to use against metastore database
javax.jdo.option.ConnectionPassword <mssql-password>

# Driver class name for a JDBC metastore
javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver

where

  • <mssql-connection-string> is the JDBC connection string (which you can get in the Azure portal). You do not need to include username and password in the connection string, because these will be set by javax.jdo.option.ConnectionUserName and javax.jdo.option.ConnectionDriverName.
  • <mssql-username> and <mssql-password> specify the username and password of your Azure SQL Database account that has read/write access to the database.

Note

For production environments, we recommend that you set hive.metastore.schema.verification to true. This prevents Hive metastore client from implicitly modifying the metastore database schema when the metastore client version does not match the metastore database version. When enabling this setting for metastore client versions lower than Hive 1.2.0, make sure that the metastore client has the write permission to the metastore database (to prevent the issue described in HIVE-9749).

  • For Hive metastore 1.2.0 and higher, set hive.metastore.schema.verification.record.version to true to enable hive.metastore.schema.verification.
  • For Hive metastore 2.1.1 and higher, set hive.metastore.schema.verification.record.version to true as it is set to false by default.

Set up an external metastore using the UI

To set up an external metastore using the Azure Databricks UI:

  1. Click the Clusters button on the sidebar.

  2. Click Create Cluster.

  3. Enter the following Spark configuration options:

    # Hive-specific configuration options.
    # spark.hadoop prefix is added to make sure these Hive specific options propagate to the metastore client.
    # JDBC connect string for a JDBC metastore
    spark.hadoop.javax.jdo.option.ConnectionURL <mssql-connection-string>
    
    # Username to use against metastore database
    spark.hadoop.javax.jdo.option.ConnectionUserName <mssql-username>
    
    # Password to use against metastore database
    spark.hadoop.javax.jdo.option.ConnectionPassword <mssql-password>
    
    # Driver class name for a JDBC metastore
    spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
    
    # Spark specific configuration options
    spark.sql.hive.metastore.version <hive-version>
    # Skip this one if <hive-version> is 0.13.x.
    spark.sql.hive.metastore.jars <hive-jar-source>
    
  4. Continue your cluster configuration, following the instructions in Compute configuration reference.

  5. Click Create Cluster to create the cluster.

Set up an external metastore using an init script

Init scripts let you connect to an existing Hive metastore without manually setting required configurations.

  1. Create the base directory you want to store the init script in if it does not exist. The following example uses dbfs:/databricks/scripts.
  2. Run the following snippet in a notebook. The snippet creates the init script /databricks/scripts/external-metastore.sh in Databricks File System (DBFS). Alternatively, you can use the DBFS REST API’s put operation to create the init script. This init script writes required configuration options to a configuration file named 00-custom-spark.conf in a JSON-like format under /databricks/driver/conf/ inside every node of the cluster, whenever a cluster with the name specified as <cluster-name> starts. Azure Databricks provides default Spark configurations in the /databricks/driver/conf/spark-branch.conf file. Configuration files in the /databricks/driver/conf directory apply in reverse alphabetical order. If you want to change the name of the 00-custom-spark.conf file, make sure that it continues to apply before the spark-branch.conf file.

Scala

dbutils.fs.put(
    "/databricks/scripts/external-metastore.sh",
    """#!/bin/sh
      |# Loads environment variables to determine the correct JDBC driver to use.
      |source /etc/environment
      |# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
      |cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
      |[driver] {
      |    # Hive specific configuration options.
      |    # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
      |    # JDBC connect string for a JDBC metastore
      |    "spark.hadoop.javax.jdo.option.ConnectionURL" = "<mssql-connection-string>"
      |
      |    # Username to use against metastore database
      |    "spark.hadoop.javax.jdo.option.ConnectionUserName" = "<mssql-username>"
      |
      |    # Password to use against metastore database
      |    "spark.hadoop.javax.jdo.option.ConnectionPassword" = "<mssql-password>"
      |
      |    # Driver class name for a JDBC metastore
      |    "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
      |
      |    # Spark specific configuration options
      |    "spark.sql.hive.metastore.version" = "<hive-version>"
      |    # Skip this one if <hive-version> is 0.13.x.
      |    "spark.sql.hive.metastore.jars" = "<hive-jar-source>"
      |}
      |EOF
      |""".stripMargin,
    overwrite = true
)

Python

contents = """#!/bin/sh
# Loads environment variables to determine the correct JDBC driver to use.
source /etc/environment
# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
[driver] {
    # Hive specific configuration options.
    # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
    # JDBC connect string for a JDBC metastore
    "spark.hadoop.javax.jdo.option.ConnectionURL" = "<mssql-connection-string>"

    # Username to use against metastore database
    "spark.hadoop.javax.jdo.option.ConnectionUserName" = "<mssql-username>"

    # Password to use against metastore database
    "spark.hadoop.javax.jdo.option.ConnectionPassword" = "<mssql-password>"

    # Driver class name for a JDBC metastore
    "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

    # Spark specific configuration options
    "spark.sql.hive.metastore.version" = "<hive-version>"
    # Skip this one if <hive-version> is 0.13.x.
    "spark.sql.hive.metastore.jars" = "<hive-jar-source>"
    }
EOF
"""

dbutils.fs.put(
    file = "/databricks/scripts/external-metastore.sh",
    contents = contents,
    overwrite = True
)
  1. Configure your cluster with the init script.
  2. Restart the cluster.

Troubleshooting

Clusters do not start (due to incorrect init script settings)

If an init script for setting up the external metastore causes cluster creation to fail, configure the init script to log, and debug the init script using the logs.

Error in SQL statement: InvocationTargetException

  • Error message pattern in the full exception stack trace:

    Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = [...]
    

    External metastore JDBC connection information is misconfigured. Verify the configured hostname, port, username, password, and JDBC driver class name. Also, make sure that the username has the right privilege to access the metastore database.

  • Error message pattern in the full exception stack trace:

    Required table missing : "`DBS`" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. [...]
    

    External metastore database not properly initialized. Verify that you created the metastore database and put the correct database name in the JDBC connection string. Then, start a new cluster with the following two Spark configuration options:

    datanucleus.schema.autoCreateTables true
    datanucleus.fixedDatastore false
    

    In this way, the Hive client library will try to create and initialize tables in the metastore database automatically when it tries to access them but finds them absent.

Error in SQL statement: AnalysisException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetastoreClient

Error message in the full exception stacktrace:

The specified datastore driver (driver name) was not found in the CLASSPATH

The cluster is configured to use an incorrect JDBC driver.

Setting datanucleus.autoCreateSchema to true doesn’t work as expected

By default, Databricks also sets datanucleus.fixedDatastore to true, which prevents any accidental structural changes to the metastore databases. Therefore, the Hive client library cannot create metastore tables even if you set datanucleus.autoCreateSchema to true. This strategy is, in general, safer for production environments since it prevents the metastore database to be accidentally upgraded.

If you do want to use datanucleus.autoCreateSchema to help initialize the metastore database, make sure you set datanucleus.fixedDatastore to false. Also, you may want to flip both flags after initializing the metastore database to provide better protection to your production environment.