Tutorial: Train a machine learning model without code (deprecated)
You can enrich your data in Spark tables with new machine learning models that you train by using automated machine learning. In Azure Synapse Analytics, you can select a Spark table in the workspace to use as a training dataset for building machine learning models, and you can do this in a code-free experience.
In this tutorial, you learn how to train machine learning models by using a code-free experience in Synapse Studio. Synapse Studio is a feature of Azure Synapse Analytics.
You'll use automated machine learning in Azure Machine Learning, instead of coding the experience manually. The type of model that you train depends on the problem you're trying to solve. For this tutorial, you'll use a regression model to predict taxi fares from the New York City taxi dataset.
If you don't have an Azure subscription, create a free account before you begin.
Warning
- Effective September 29, 2023, Azure Synapse will discontinue official support for Spark 2.4 Runtimes. Post September 29, 2023, we will not be addressing any support tickets related to Spark 2.4. There will be no release pipeline in place for bug or security fixes for Spark 2.4. Utilizing Spark 2.4 post the support cutoff date is undertaken at one's own risk. We strongly discourage its continued use due to potential security and functionality concerns.
- As part of the deprecation process for Apache Spark 2.4, we would like to notify you that AutoML in Azure Synapse Analytics will also be deprecated. This includes both the low code interface and the APIs used to create AutoML trials through code.
- Please note that AutoML functionality was exclusively available through the Spark 2.4 runtime.
- For customers who wish to continue leveraging AutoML capabilities, we recommend saving your data into your Azure Data Lake Storage Gen2 (ADLSg2) account. From there, you can seamlessly access the AutoML experience through Azure Machine Learning (AzureML). Further information regarding this workaround is available here.
Prerequisites
- An Azure Synapse Analytics workspace. Ensure that it has an Azure Data Lake Storage Gen2 storage account configured as the default storage. For the Data Lake Storage Gen2 file system that you work with, ensure that you're the Storage Blob Data Contributor.
- An Apache Spark pool (version 2.4) in your Azure Synapse Analytics workspace. For details, see Quickstart: Create a serverless Apache Spark pool using Synapse Studio.
- An Azure Machine Learning linked service in your Azure Synapse Analytics workspace. For details, see Quickstart: Create a new Azure Machine Learning linked service in Azure Synapse Analytics.
Sign in to the Azure portal
Sign in to the Azure portal.
Create a Spark table for the training dataset
For this tutorial, you need a Spark table. The following notebook creates one:
Download the notebook Create-Spark-Table-NYCTaxi- Data.ipynb.
Import the notebook to Synapse Studio.
Select the Spark pool that you want to use, and then select Run all. This step gets New York taxi data from the open dataset and saves the data to your default Spark database.
After the notebook run has completed, you see a new Spark table under the default Spark database. From Data, find the table named nyc_taxi.
Open the automated machine learning wizard
To open the wizard, right-click the Spark table that you created in the previous step. Then select Machine Learning > Train a new model.
Choose a model type
Select the machine learning model type for the experiment, based on the question you're trying to answer. Because the value you’re trying to predict is numeric (taxi fares), select Regression here. Then select Continue.
Configure the experiment
Provide configuration details for creating an automated machine learning experiment run in Azure Machine Learning. This run trains multiple models. The best model from a successful run is registered in the Azure Machine Learning model registry.
Azure Machine Learning workspace: An Azure Machine Learning workspace is required for creating an automated machine learning experiment run. You also need to link your Azure Synapse Analytics workspace with the Azure Machine Learning workspace by using a linked service. After you've fulfilled all the prerequisites, you can specify the Azure Machine Learning workspace that you want to use for this automated run.
Experiment name: Specify the experiment name. When you submit an automated machine learning run, you provide an experiment name. Information for the run is stored under that experiment in the Azure Machine Learning workspace. This experience creates a new experiment by default and generates a proposed name, but you can also provide the name of an existing experiment.
Best model name: Specify the name of the best model from the automated run. The best model is given this name and saved in the Azure Machine Learning model registry automatically after this run. An automated machine learning run creates many machine learning models. Based on the primary metric that you select in a later step, those models can be compared and the best model can be selected.
Target column: This is what the model will be trained to predict. Choose the column in the dataset that contains the data you want to predict. For this tutorial, select the numeric column
fareAmount
as the target column.Spark pool: Specify the Spark pool that you want to use for the automated experiment run. The computations are run on the pool that you specify.
Spark configuration details: In addition to the Spark pool, you have the option to provide session configuration details.
Select Continue.
Configure the model
Because you selected Regression as your model type in the previous section, the following configurations are available (these are also available for the Classification model type):
Primary metric: Enter the metric that measures how well the model is doing. You use this metric to compare different models created in the automated run and determine which model performed best.
Training job time (hours): Specify the maximum amount of time, in hours, for an experiment to run and train models. Note that you can also provide values less than 1 (for example, 0.5).
Max concurrent iterations: Choose the maximum number of iterations that run in parallel.
ONNX model compatibility: If you enable this option, the models trained by automated machine learning are converted to the ONNX format. This is particularly relevant if you want to use the model for scoring in Azure Synapse Analytics SQL pools.
These settings all have a default value that you can customize.
Start a run
After all the required configurations are done, you can start your automated run. You can choose to create a run directly by selecting Create run - this starts the run without code. Alternatively, if you prefer code, you can select Open in notebook - this opens a notebook containing the code that creates the run so you can view the code and start the run yourself.
Note
If you selected Time series forecasting as your model type in the previous section, you must make additional configurations. Forecasting also doesn't support ONNX model compatibility.
Create a run directly
To start your automated machine learning run directly, select Create Run. You see a notification that indicates the run is starting. Then you see another notification that indicates success. You can also check the status in Azure Machine Learning by selecting the link in the notification.
Create a run with a notebook
To generate a notebook, select Open In Notebook. This gives you an opportunity to add settings or otherwise modify the code for your automated machine learning run. When you're ready to run the code, select Run all.
Monitor the run
After you've successfully submitted the run, you see a link to the experiment run in the Azure Machine Learning workspace in the notebook output. Select the link to monitor your automated run in Azure Machine Learning.