Use Apache Sqoop to import and export data between Apache Hadoop on HDInsight and Azure SQL Database

2024-09-06

Learn how to use Apache Sqoop to import and export between an Apache Hadoop cluster in Azure HDInsight and Azure SQL Database or Microsoft SQL Server. The steps in this document use the sqoop command directly from the headnode of the Hadoop cluster. You use SSH to connect to the head node and run the commands in this document. This article is a continuation of Use Apache Sqoop with Hadoop in HDInsight.

Prerequisites

Completion of Set up test environment from Use Apache Sqoop with Hadoop in HDInsight.
An SSH client. For more information, see Connect to HDInsight (Apache Hadoop) using SSH.
Familiarity with Sqoop. For more information, see Sqoop User Guide.

Set up

Use ssh command to connect to your cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
```
ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net
```

For ease of use, set variables. Replace PASSWORD, MYSQLSERVER, and MYDATABASE with the relevant values, and then enter the commands below:

export PASSWORD='PASSWORD'
export SQL_SERVER="MYSQLSERVER"
export DATABASE="MYDATABASE"


export SERVER_CONNECT="jdbc:sqlserver://$SQL_SERVER.database.windows.net:1433;user=sqluser;password=$PASSWORD"
export SERVER_DB_CONNECT="jdbc:sqlserver://$SQL_SERVER.database.windows.net:1433;user=sqluser;password=$PASSWORD;database=$DATABASE"

Sqoop export

From Hive to SQL.

To verify that Sqoop can see your database, enter the command below in your open SSH connection. This command returns a list of databases.
```
sqoop list-databases --connect $SERVER_CONNECT
```
Enter the following command to see a list of tables for the specified database:
```
sqoop list-tables --connect $SERVER_DB_CONNECT
```
To export data from the Hive hivesampletable table to the mobiledata table in your database, enter the command below in your open SSH connection:
```
sqoop export --connect $SERVER_DB_CONNECT \
-table mobiledata \
--hcatalog-table hivesampletable
```

To verify that data was exported, use the following queries from your SSH connection to view the exported data:

sqoop eval --connect $SERVER_DB_CONNECT \
--query "SELECT COUNT(*) from dbo.mobiledata WITH (NOLOCK)"


sqoop eval --connect $SERVER_DB_CONNECT \
--query "SELECT TOP(10) * from dbo.mobiledata WITH (NOLOCK)"

Sqoop import

From SQL to Azure storage.

Enter the command below in your open SSH connection to import data from the mobiledata table in SQL, to the wasbs:///tutorials/usesqoop/importeddata directory on HDInsight. The fields in the data are separated by a tab character, and the lines are terminated by a new-line character.
```
sqoop import --connect $SERVER_DB_CONNECT \
--table mobiledata \
--target-dir 'wasb:///tutorials/usesqoop/importeddata' \
--fields-terminated-by '\t' \
--lines-terminated-by '\n' -m 1
```

Alternatively, you can also specify a Hive table:

sqoop import --connect $SERVER_DB_CONNECT \
--table mobiledata \
--target-dir 'wasb:///tutorials/usesqoop/importeddata2' \
--fields-terminated-by '\t' \
--lines-terminated-by '\n' \
--create-hive-table \
--hive-table mobiledata_imported2 \
--hive-import -m 1

Once the import has completed, enter the following command in your open SSH connection to list out the data in the new directory:
```
hadoop fs -tail /tutorials/usesqoop/importeddata/part-m-00000
```

Use beeline to verify that the table has been created in Hive.

Connect

beeline -u 'jdbc:hive2://headnodehost:10001/;transportMode=http'

Execute each query below one at a time and review the output:

show tables;
describe mobiledata_imported2;
SELECT COUNT(*) FROM mobiledata_imported2;
SELECT * FROM mobiledata_imported2 LIMIT 10;

Exit beeline with !exit.

Limitations

Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to SQL doesn't support bulk inserts.
Batching - With Linux-based HDInsight, When using the -batch switch when performing inserts, Sqoop makes multiple inserts instead of batching the insert operations.

Important considerations

Both HDInsight and SQL Server must be on the same Azure Virtual Network.

For an example, see How to connect HDInsight to your on-premises network document.

For more information on using HDInsight with an Azure Virtual Network, see how to extend HDInsight with Azure Virtual Network document. For more information on Azure Virtual Network, see the Virtual Network Overview document.
SQL Server must be configured to allow SQL authentication. For more information, see the Choose an Authentication Mode document.
You may have to configure SQL Server to accept remote connections.

Next steps

Now you've learned how to use Sqoop. To learn more, see:

Use Apache Oozie with HDInsight: Use Sqoop action in an Oozie workflow.
Analyze flight delay data using HDInsight: Use Interactive Query to analyze flight delay data, and then use Sqoop to export data to a database in Azure.
Upload data to HDInsight: Find other methods for uploading data to HDInsight/Azure Blob storage.

Share via