Rediger

Del via


Migrate default Hive metastore DB to external metastore DB

This article shows how to migrate metadata from a default metastore DB for Hive to an external SQL Database on HDInsight.

Why migrate to external metastore DB

  • Default metastore DB is limited to basic SKU and can't handle production scale workloads.

  • External metastore DB enables customer to horizontally scale Hive compute resources by adding new HDInsight clusters sharing the same metastore DB.

  • For HDInsight 3.6 to 4.0 migration, it's mandatory to migrate metadata to external metastore DB before upgrading the Hive schema version. See migrating workloads from HDInsight 3.6 to HDInsight 4.0.

The default metastore DB with limited compute capacity, hence we recommend low utilization from other jobs on the cluster while migrating metadata.

Source and target DBs must use the same HDInsight version and the same Storage Accounts. If upgrading HDInsight versions from 3.6 to 4.0, complete the steps in this article first. Then, follow the official upgrade steps here.

Prerequisites

Run the following script action to make these locations portable to other clusters. See Script action to a running cluster.

The action is similar to replacing symlinks with their full paths.

Property Value
Bash script URI https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/hive-adl-expand-location-v01.sh
Node types Head
Parameters ""

Migrate with Export/Import using sqlpackage

An HDInsight cluster created only after 2020-10-15 supports SQL Export/Import for the Hive default metastore DB by using sqlpackage.

  1. Install sqlpackage to the cluster.

  2. Export the default metastore DB to BACPAC file by executing the following command.

    wget "https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/hive_metastore_tool.py"
    SQLPACKAGE_FILE='/home/sshuser/sqlpackage/sqlpackage'  # replace with sqlpackage location
    TARGET_FILE='hive.bacpac'
    sudo python hive_metastore_tool.py --sqlpackagefile $SQLPACKAGE_FILE --targetfile $TARGET_FILE
    
  3. Save the BACPAC file.

    hdfs dfs -mkdir -p /bacpacs
    hdfs dfs -put $TARGET_FILE /bacpacs/
    
  4. Import the BACPAC file to a new database with steps listed here.

  5. The new database is ready to be configured as external metastore DB on a new HDInsight cluster.

Migrate using Hive script

Clusters created before 2020-10-15 don't support export/import of the default metastore DB.

For such clusters, follow the guide Copy Hive tables across Storage Accounts, using a second cluster with an external Hive metastore DB. The second cluster can use the same storage account but must use a new default filesystem.

Option to "shallow" copy

Storage consumption would double when tables are "deep" copied using the guide. You need to manually clean the data in the source storage container. We can, instead, "shallow" copy the tables if they're nontransactional. All Hive tables in HDInsight 3.6 are nontransactional by default, but only external tables are nontransactional in HDInsight 4.0. Transactional tables must be deep copied. Follow these steps to shallow copy nontransactional tables:

  1. Execute script hive-ddls.sh on the source cluster's primary headnode to generate the DDL for every Hive table.
  2. The DDL is written to a local Hive script named /tmp/hdi_hive_ddls.hql. Execute this on the target cluster that uses an external Hive metastore DB.

Verify that all Hive tables are imported

The following command uses a SQL query on the metastore DB to print all Hive tables and their data locations. Compare outputs between new and old clusters to verify that no tables are missing in the new metastore DB.

SCRIPT_FNAME='hive_metastore_tool.py'
SCRIPT="/tmp/$SCRIPT_FNAME"
wget -O "$SCRIPT" "https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/$SCRIPT_FNAME"
OUTPUT_FILE='/tmp/hivetables.csv'
QUERY="SELECT DBS.NAME, TBLS.TBL_NAME, SDS.LOCATION FROM SDS, TBLS, DBS WHERE TBLS.SD_ID = SDS.SD_ID AND TBLS.DB_ID = DBS.DB_ID ORDER BY DBS.NAME, TBLS.TBL_NAME ASC;"
sudo python "$SCRIPT" --query "$QUERY" > $OUTPUT_FILE

Further reading