In this four-part tutorial series, use Python to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services to cluster customer data.
In part one of this series, set up the prerequisites for the tutorial and then restore a sample dataset to a database. Later in this series, use this data to train and deploy a clustering model in Python with SQL machine learning.
In parts two and three of this series, develop some Python scripts in an Azure Data Studio notebook to analyze and prepare your data and train a machine learning model. Then, in part four, run those Python scripts inside a database using stored procedures.
Clustering can be explained as organizing data into groups where members of a group are similar in some way. For this tutorial series, imagine you own a retail business. Use the K-Means algorithm to perform the clustering of customers in a dataset of product purchases and returns. By clustering customers, you can focus your marketing efforts more effectively by targeting specific groups. K-Means clustering is an unsupervised learning algorithm that looks for patterns in data based on similarities.
In this article, learn how to:
Restore a sample database
In part two, learn how to prepare the data from a database to perform clustering.
In part three, learn how to create and train a K-Means clustering model in Python.
In part four, learn how to create a stored procedure in a database that can perform clustering in Python based on new data.
Additional Python packages - The examples in this tutorial series use Python packages that you might or might not have installed.
Open an Administrative Command Prompt and change to the installation path for the version of Python you use in Azure Data Studio. For example, cd %LocalAppData%\Programs\Python\Python37-32. Then run the following commands to install any of these packages that aren't already installed. Ensure these packages are installed in the correct Python installation location. You can use the option -t to specify the destination directory.
Run the following icacls commands to grant READ & EXECUTE access to the installed libraries to SQL Server Launchpad Service and SID S-1-15-2-1 (ALL_APPLICATION_PACKAGES).
The sample dataset used in this tutorial has been saved to a .bak database backup file for you to download and use. This dataset is derived from the tpcx-bb dataset provided by the Transaction Processing Performance Council (TPC).
Manage data ingestion and preparation, model training and deployment, and machine learning solution monitoring with Python, Azure Machine Learning and MLflow.