Use curl to load data into HDFS on SQL Server Big Data Clusters
Applies to: SQL Server 2019 (15.x)
This article explains how to use curl to load data into HDFS on SQL Server 2019 Big Data Clusters.
Important
The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.
Prerequisites
Obtain the service external IP
WebHDFS is started when deployment is completed, and its access goes through Knox. The Knox endpoint is exposed through a Kubernetes service called gateway-svc-external. To create the necessary WebHDFS URL to upload/download files, you need the gateway-svc-external service external IP address and the name of your big data cluster. You can get the gateway-svc-external service external IP address by running the following command:
kubectl get service gateway-svc-external -n <big data cluster name> -o json | jq -r .status.loadBalancer.ingress[0].ip
Note
The <big data cluster name>
here is the name of the cluster that you specified in the deployment configuration file. The default name is mssql-cluster
.
Construct the URL to access WebHDFS
Now, you can construct the URL to access the WebHDFS as follows:
https://<gateway-svc-external service external IP address>:30443/gateway/default/webhdfs/v1/
For example:
https://13.66.190.205:30443/gateway/default/webhdfs/v1/
Authentication with Active Directory
For deployments with Active Directory, use the authentication parameter with curl
with Negotiate authentication.
To use curl
with Active Directory authentication, run this command:
kinit <username>
The command generates a Kerberos token for curl
to use. The commands demonstrated in the next sections specify the --anyauth
parameter for curl
. For URLs that require Negotiate authentication, curl
automatically detects and uses the generated Kerberos token instead of username and password to authenticate to the URLs.
List a file
To list file under hdfs:///product_review_data, use the following curl command:
curl -i -k --anyauth -u root:<AZDATA_PASSWORD> -X GET 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/?op=liststatus'
Beginning with SQL Server 2019 (15.x) CU 5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME
and AZDATA_PASSWORD
. Endpoints on clusters that are upgraded to CU 5 continue to use root
as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.
For endpoints that do not use root, use the following curl command:
curl -i -k --anyauth -u <AZDATA_USERNAME>:<AZDATA_PASSWORD> -X GET 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/?op=liststatus'
Put a local file into HDFS
To put a new file test.csv from local directory to product_review_data directory, use the following curl command (the Content-Type parameter is required):
curl -i -L -k --anyauth -u root:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/test.csv?op=create' -H 'Content-Type: application/octet-stream' -T 'test.csv'
Beginning with SQL Server 2019 (15.x) CU 5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME
and AZDATA_PASSWORD
. Endpoints on clusters that are upgraded to CU 5 continue to use root
as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.
For endpoints that do not use root, use the following curl command:
curl -i -L -k --anyauth -u <AZDATA_USERNAME>:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/test.csv?op=create' -H 'Content-Type: application/octet-stream' -T 'test.csv'
Create a directory
To create a directory test under hdfs:///
, use the following command:
curl -i -L -k --anyauth -u root:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/test?op=MKDIRS'
Beginning with SQL Server 2019 (15.x) CU 5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME
and AZDATA_PASSWORD
. Endpoints on clusters that are upgraded to CU 5 continue to use root
as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.
For endpoints that do not use root, use the following curl command:
curl -i -L -k --anyauth -u <AZDATA_USERNAME>:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/test?op=MKDIRS'
Next steps
For more information, see Introducing SQL Server Big Data Clusters.