Machine Learning guide for SQL Server Big Data Clusters
Article
Applies to:
SQL Server 2019 (15.x)
This article explains how to use SQL Server Big Data Clusters for Machine Learning Scenarios.
Important
The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.
Introduction to Machine Learning in SQL Server Big Data Clusters
SQL Server Big Data Clusters enables machine learning scenarios and solutions using different technology stacks: SQL Server Machine Learning Services and Apache Spark ML.
SQL Server Big Data Clusters offer Machine Learning capabilities inside the SQL Server engine, using the established SQL Server Machine Learning Services technology stack; enabling a high-performance, in-database Machine Learning inference and scoring scenarios.
For big data-based machine learning scenarios, the usage of HDFS for big data hosting and Apache Spark ML capabilities is more cost-effective, scalable, and powerful.
Machine Learning Scenarios
The machine learning capabilities enable different applications and solutions such as: fraud detection, forecasting, churn, and general classification and regression tasks. Yet, it is important to use the best technology for a scenario.
Aspect
SQL Server Machine Learning Services
Apache Spark ML
Data placement
Leverages tabular data locality on SQL Server. Premium data tier.
Scalable Big Data data tier using HDFS; either unstructured, semi-structured, and structured data.
Best for
Low latency inference and scoring scenarios
1. Distributed batch training and scoring machine learning models on top of Big Data 2. ETL sinks and large-scale data preparation and featurization for ML
Feeds
ML powered BI dashboards, reports, and applications. Low latency required
Batch scored data may be promoted to SQL Server to drive ML powered scenarios
Azure Databricks is a cloud-scale platform for data analytics and machine learning. Data scientists and machine learning engineers can use Azure Databricks to implement machine learning solutions at scale. (DP-3014)
Manage data ingestion and preparation, model training and deployment, and machine learning solution monitoring with Python, Azure Machine Learning and MLflow.