Introducing Spark Machine Learning on SQL Server Big Data Clusters

Artikkeli
11/18/2022

Applies to: SQL Server 2019 (15.x)

Important

The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.

This article explains how to effectively use Spark for Machine Learning on SQL Server Big Data Clusters.

Spark Machine Learning in SQL Server Big Data Clusters

SQL Server Big Data Clusters enables machine learning scenarios and solutions using different technology stacks: SQL Server Machine Learning Services and Apache Spark ML.

To better understand when to use each technology stack, refer to Machine Learning guide for SQL Server Big Data Clusters. This guide covers Apache Spark ML.

For big data-based machine learning scenarios, the usage of HDFS for big data hosting and Apache Spark ML capabilities is a more cost-effective, scalable, and powerful approach. Yet this is far from an exhaustive list of the possibilities of what can be achieved with Spark Machine Learning, for a complete list of features see: Spark MLlib.

The next section provides a curated list of scenarios and references for Spark in SQL Server Big Data Clusters.

Building blocks for Spark Machine Learning on SQL Server Big Data Clusters

Learn	Contents	Link
SQL Server Big Data Clusters runtime for Apache Spark	This will show what's included with each release	SQL Server Big Data Clusters runtime for Apache Spark Guide
The Storage Pool	How to store and use HDFS + Spark together to unlock data for machine learning	Introducing the storage pool in SQL Server Big Data Clusters
Use notebook-based experiences and your tools of choice	Connect Spark-Livy endpoint using your tools of choice	Submit Spark jobs on SQL Server Big Data Clusters in Azure Data Studio Submit Spark jobs on SQL Server big data cluster in Visual Studio Code Use sparklyr in SQL Server big data cluster
How to install extra packages	In the case a package is not provided out-of-the-box, install it	Spark library management
How to troubleshoot	In case it breaks	Troubleshoot a `pyspark` notebook Debug and Diagnose Spark Applications on SQL Server Big Data Clusters in Spark History Server
How to submit machine learning batch jobs	Make ML training and batch scoring run using the command line	Submit Spark jobs by using command-line tools
How to quickly move data between SQL Server and Spark	Make SQL Server source and/or destination for your Spark ML scenarios. Usage of HDFS is not mandatory	Use the Apache Spark Connector for SQL Server and Azure SQL
Spark model operationalization	After training, operationalize using MLeap	Create, export, and score Spark machine learning models on SQL Server Big Data Clusters
Data wrangling	Along with Spark's powerful data wrangling capabilities, we ship PROSE	Data Wrangling using PROSE Code Accelerator

Next steps

For more information, see Introducing SQL Server Big Data Clusters.

Jaa

Introducing Spark Machine Learning on SQL Server Big Data Clusters

Spark Machine Learning in SQL Server Big Data Clusters

Building blocks for Spark Machine Learning on SQL Server Big Data Clusters

Next steps

Palaute

Lisäresursseja