Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest SQL, Fabric and Power BI learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Applies to:
SQL Server 2019 (15.x)
This article provides guidance about how to use command-line tools to run Spark jobs on SQL Server Big Data Clusters.
Important
The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.
azdata
curl
application to perform REST API calls to LivyThis article provides examples of how to use command-line patterns to submit Spark applications to SQL Server Big Data Clusters.
The Azure Data CLI azdata bdc spark
commands surface all capabilities of SQL Server Big Data Clusters Spark on the command line. This article focuses on job submission. But azdata bdc spark
also supports interactive modes for Python, Scala, SQL, and R through the azdata bdc spark session
command.
If you need direct integration with a REST API, use standard Livy calls to submit jobs. This article uses the curl
command-line tool in the Livy examples to run the REST API call. For a detailed example that shows how to interact with the Spark Livy endpoint by using Python code, see Use Spark from the Livy endpoint on GitHub.
This extract, transform, and load (ETL) application follows a common data engineering pattern. It loads tabular data from an Apache Hadoop Distributed File System (HDFS) landing zone path. It then uses a table format to write to an HDFS-processed zone path.
Download the sample application's dataset. Then create PySpark applications by using PySpark, Spark Scala, or Spark SQL.
In the following sections, you'll find sample exercises for each solution. Select the tab for your platform. You'll run the application by using azdata
or curl
.
This example uses the following PySpark application. It's saved as a Python file named parquet_etl_sample.py on the local machine.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Read clickstream_data from storage pool HDFS into a Spark data frame. Applies column renames.
df = spark.read.option("inferSchema", "true").csv('/securelake/landing/criteo/test.txt', sep='\t',
header=False).toDF("feat1","feat2","feat3","feat4","feat5","feat6","feat7","feat8",
"feat9","feat10","feat11","feat12","feat13","catfeat1","catfeat2","catfeat3","catfeat4",
"catfeat5","catfeat6","catfeat7","catfeat8","catfeat9","catfeat10","catfeat11","catfeat12",
"catfeat13","catfeat14","catfeat15","catfeat16","catfeat17","catfeat18","catfeat19",
"catfeat20","catfeat21","catfeat22","catfeat23","catfeat24","catfeat25","catfeat26")
# Print the data frame inferred schema
df.printSchema()
tot_rows = df.count()
print("Number of rows:", tot_rows)
# Drop the managed table
spark.sql("DROP TABLE dl_clickstream")
# Write data frame to HDFS managed table by using optimized Delta Lake table format
df.write.format("parquet").mode("overwrite").saveAsTable("dl_clickstream")
print("Sample ETL pipeline completed")
Store the application in HDFS so the cluster can access it for execution. As a best practice, standardize and govern application locations within the cluster to streamline administration.
In this example use case, all ETL pipeline applications are stored on the hdfs:/apps/ETL-Pipelines path. The sample application is stored at hdfs:/apps/ETL-Pipelines/parquet_etl_sample.py.
Run the following command to upload parquet_etl_sample.py from the local development or staging machine to the HDFS cluster.
azdata bdc hdfs cp --from-path parquet_etl_sample.py --to-path "hdfs:/apps/ETL-Pipelines/parquet_etl_sample.py"
Use the following command to submit the application to SQL Server Big Data Clusters Spark for execution.
The azdata
command runs the application by using commonly specified parameters. For complete parameter options for azdata bdc spark batch create
, see azdata bdc spark
.
This application requires the spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation
configuration parameter. So the command uses the --config
option. This setup shows how to pass configurations into the Spark session.
You can use the --config
option to specify multiple configuration parameters. You could also specify them inside the application session by setting the configuration in the SparkSession
object.
azdata bdc spark batch create -f hdfs:/apps/ETL-Pipelines/parquet_etl_sample.py \
--config '{"spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation":"true"}' \
-n MyETLPipelinePySpark --executor-count 2 --executor-cores 2 --executor-memory 1664m
Warning
The "name" or "n" parameter for batch name should be unique each time a new batch is created.
The azdata bdc spark batch
commands provide management actions for Spark batch jobs.
To list all running jobs, run the following command.
The azdata
command:
azdata bdc spark batch list -o table
The curl
command, using Livy:
curl -k -u <USER>:<PASSWORD> -X POST <LIVY_ENDPOINT>/batches
To get information for a Spark batch with the given ID, run the following command. The batch id
is returned from spark batch create
.
The azdata
command:
azdata bdc spark batch info --batch-id 0
The curl
command, using Livy:
curl -k -u <USER>:<PASSWORD> -X POST <LIVY_ENDPOINT>/batches/<BATCH_ID>
To get state information for a Spark batch with the given ID, run the following command.
The azdata
command:
azdata bdc spark batch state --batch-id 0
The curl
command, using Livy:
curl -k -u <USER>:<PASSWORD> -X POST <LIVY_ENDPOINT>/batches/<BATCH_ID>/state
To get the logs for a Spark batch with the given ID, run the following command.
The azdata
command:
azdata bdc spark batch log --batch-id 0
The curl
command, using Livy:
curl -k -u <USER>:<PASSWORD> -X POST <LIVY_ENDPOINT>/batches/<BATCH_ID>/log
For information about troubleshooting Spark code, see Troubleshoot a PySpark notebook.
Comprehensive Spark sample code is available at SQL Server Big Data Clusters Spark samples on GitHub.
For more information about SQL Server Big Data Clusters and related scenarios, see SQL Server Big Data Clusters.
Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest SQL, Fabric and Power BI learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register today