How to use the ML.NET Automated Machine Learning (AutoML) API
In this article, you learn how to use the ML.NET Automated ML (AutoML API).
Samples for the AutoML API can be found in the dotnet/machinelearning-samples repo.
Installation
To use the AutoML API, install the Microsoft.ML.AutoML
NuGet package in the .NET project you want to reference it in.
Note
This guide uses version 0.20.0 and later of the Microsoft.ML.AutoML
NuGet package. Although samples and code from earlier versions still work, it is highly recommended you use the APIs introduced in this version for new projects.
For more information on installing NuGet packages, see the following guides:
Quick Start
AutoML provides several defaults for quickly training machine learning models. In this section you'll learn how to:
- Load your data
- Define your pipeline
- Configure your experiment
- Run your experiment
- Use the best model to make predictions
Define your problem
Given a dataset stored in a comma-separated file called taxi-fare-train.csv that looks like the following:
vendor_id | rate_code | passenger_count | trip_time_in_secs | trip_distance | payment_type | fare_amount |
---|---|---|---|---|---|---|
CMT | 1 | 1 | 1271 | 3.8 | CRD | 17.5 |
CMT | 1 | 1 | 474 | 1.5 | CRD | 8 |
CMT | 1 | 1 | 637 | 1.4 | CRD | 8.5 |
Load your data
Start by initializing your MLContext. MLContext
is a starting point for all ML.NET operations. Initializing mlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to DBContext
in Entity Framework.
Then, to load your data, use the InferColumns method.
// Initialize MLContext
MLContext ctx = new MLContext();
// Define data path
var dataPath = Path.GetFullPath(@"..\..\..\..\Data\taxi-fare-train.csv");
// Infer column information
ColumnInferenceResults columnInference =
ctx.Auto().InferColumns(dataPath, labelColumnName: "fare_amount", groupColumns: false);
InferColumns loads a few rows from the dataset. It then inspects the data and tries to guess or infer the data type for each of the columns based on their content.
The default behavior is to group columns of the same type into feature vectors or arrays containing the elements for each of the individual columns. Setting groupColumns
to false
overrides that default behavior and only performs column inference without grouping columns. By keeping columns separate, it allows you to apply different data transformations when preprocessing the data at the individual column level rather than the column grouping.
The result of InferColumns is a ColumnInferenceResults object that contains the options needed to create a TextLoader as well as column information.
For the sample dataset in taxi-fare-train.csv, column information might look like the following:
- LabelColumnName: fare_amount
- CategoricalColumnNames: vendor_id, payment_type
- NumericColumnNames: rate_code, passenger_count, trip_time_in_secs, trip_distance
Once you have your column information, use the TextLoader.Options defined by the ColumnInferenceResults to create a TextLoader to load your data into an IDataView.
// Create text loader
TextLoader loader = ctx.Data.CreateTextLoader(columnInference.TextLoaderOptions);
// Load data into IDataView
IDataView data = loader.Load(dataPath);
It's often good practice to split your data into train and validation sets. Use TrainTestSplit to create an 80% training and 20% validation split of your dataset.
TrainTestData trainValidationData = ctx.Data.TrainTestSplit(data, testFraction: 0.2);
Define your pipeline
Your pipeline defines the data processing steps and machine learning pipeline to use for training your model.
SweepablePipeline pipeline =
ctx.Auto().Featurizer(data, columnInformation: columnInference.ColumnInformation)
.Append(ctx.Auto().Regression(labelColumnName: columnInference.ColumnInformation.LabelColumnName));
A SweepablePipeline is a collection of SweepableEstimator. A SweepableEstimator is an ML.NET Estimator with a SearchSpace.
The Featurizer is a convenience API that builds a sweepable pipeline of data processing sweepable estimators based on the column information you provide. Instead of building a pipeline from scratch, Featurizer automates the data preprocessing step. For more information on supported transforms by ML.NET, see the data transformations guide.
The Featurizer output is a single column containing a numerical feature vector representing the transformed data for each of the columns. This feature vector is then used as input for the algorithms used to train a machine learning model.
If you want finer control over your data preprocessing, you can create a pipeline with each of the individual preprocessing steps. For more information, see the prepare data for building a model guide.
Tip
Use Featurizer with ColumnInferenceResults to maximize the utility of AutoML.
For training, AutoML provides a sweepable pipeline with default trainers and search space configurations for the following machine learning tasks:
For the taxi fare prediction problem, since the goal is to predict a numerical value, use Regression
. For more information on choosing a task, see Machine learning tasks in ML.NET
Configure your experiment
First, create an AutoML experiment. An AutoMLExperiment is a collection of TrialResult.
AutoMLExperiment experiment = ctx.Auto().CreateExperiment();
Once your experiment is created, use the extension methods it provides to configure different settings.
experiment
.SetPipeline(pipeline)
.SetRegressionMetric(RegressionMetric.RSquared, labelColumn: columnInference.ColumnInformation.LabelColumnName)
.SetTrainingTimeInSeconds(60)
.SetDataset(trainValidationData);
In this example, you:
- Set the sweepable pipeline to run during the experiment by calling SetPipeline.
- Choose
RSquared
as the metric to optimize during training by calling SetRegressionMetric. For more information on evaluation metrics, see the evaluate your ML.NET model with metrics guide. - Set 60 seconds as the amount of time you want to train for by calling SetTrainingTimeInSeconds. A good heuristic to determine how long to train for is the size of your data. Typically, larger datasets require longer training time. For more information, see training time guidance.
- Provide the training and validation datasets to use by calling SetDataset.
Once your experiment is defined, you'll want some way to track its progress. The quickest way to track progress is by modifying the Log event from MLContext.
// Log experiment trials
ctx.Log += (_, e) => {
if (e.Source.Equals("AutoMLExperiment"))
{
Console.WriteLine(e.RawMessage);
}
};
Run your experiment
Now that you've defined your experiment, use the RunAsync method to start your experiment.
TrialResult experimentResults = await experiment.RunAsync();
Once the time to train expires, the result is a TrialResult for the best model found during training.
At this point, you can save your model or use it for making predictions. For more information on how use an ML.NET model, see the following guides:
Modify column inference results
Because InferColumns only loads a subset of your data, it's possible that edge cases contained outside of the samples used to infer columns aren't caught and the wrong data types are set for your columns. You can update the properties of ColumnInformation to account for those cases where the column inference results aren't correct.
For example, in the taxi fare dataset, the data in the rate_code
column is a number. However, that numerical value represents a category. By default, calling InferColumns will place rate_code
in the NumericColumnNames
property instead of CategoricalColumnNames
. Because these properties are .NET collections, you can use standard operations to add and remove items from them.
You can do the following to update the ColumnInformation for rate_code
.
columnInference.ColumnInformation.NumericColumnNames.Remove("rate_code");
columnInference.ColumnInformation.CategoricalColumnNames.Add("rate_code");
Exclude trainers
By default, AutoML tries multiple trainers as part of the training process to see which one works best for your data. However, throughout the training process you might discover there are some trainers that use up too many compute resources or don't provide good evaluation metrics. You have the option to exclude trainers from the training process. Which trainers are used depends on the task. For a list of supported trainers in ML.NET, see the Machine learning tasks in ML.NET guide.
For example, in the taxi fare regression scenario, to exclude the LightGBM algorithm, set the useLgbm
parameter to false
.
ctx.Auto().Regression(labelColumnName: columnInference.ColumnInformation.LabelColumnName, useLgbm:false)
The process for excluding trainers in other tasks like binary and multiclass classification works the same way.
Customize a sweepable estimator
When you want to more granular customization of estimator options included as part of your sweepable pipeline, you need to:
- Initialize a search space
- Use the search space to define a custom factory
- Create a sweepable estimator
- Add your sweepable estimator to your sweepable pipeline
AutoML provides a set of preconfigured search spaces for trainers in the following machine learning tasks:
In this example, the search space used is for the SdcaRegressionTrainer. Initialize it by using SdcaOption.
var sdcaSearchSpace = new SearchSpace<SdcaOption>();
Then, use the search space to define a custom factory method to create the SdcaRegressionTrainer. In this example, the values of L1Regularization
and L2Regularization
are both being set to something other than the default. For L1Regularization
, the value set is determined by the tuner during each trial. The L2Regularization
is fixed for each trial to the hard-coded value. During each trial, the custom factory's output is an SdcaRegressionTrainer with the configured hyperparameters.
// Use the search space to define a custom factory to create an SdcaRegressionTrainer
var sdcaFactory = (MLContext ctx, SdcaOption param) =>
{
var sdcaOption = new SdcaRegressionTrainer.Options();
sdcaOption.L1Regularization = param.L1Regularization;
sdcaOption.L2Regularization = 0.02f;
sdcaOption.LabelColumnName = columnInference.ColumnInformation.LabelColumnName;
return ctx.Regression.Trainers.Sdca(sdcaOption);
};
A sweepable estimator is the combination of an estimator and a search space. Now that you've defined a search space and used it to create a custom factory method for generating trainers, use the CreateSweepableEstimator method to create a new sweepable estimator.
// Define Sdca sweepable estimator (SdcaRegressionTrainer + SdcaOption search space)
var sdcaSweepableEstimator = ctx.Auto().CreateSweepableEstimator(sdcaFactory, sdcaSearchSpace);
To use your sweepable estimator in your experiment, add it to your sweepable pipeline.
SweepablePipeline pipeline =
ctx.Auto().Featurizer(data, columnInformation: columnInference.ColumnInformation)
.Append(sdcaSweepableEstimator);
Because sweepable pipelines are a collection of sweepable estimators, you can configure and customize as many of these sweepable estimators as you need.
Customize your search space
There are scenarios where you want to go beyond customizing the sweepable estimators used in your experiment and want control the search space range. You can do so by accessing the search space properties using keys. In this case, the L1Regularization
parameter is a float
. Therefore, to customize the search range, use UniformSingleOption.
sdcaSearchSpace["L1Regularization"] = new UniformSingleOption(min: 0.01f, max: 2.0f, logBase: false, defaultValue: 0.01f);
Depending on the data type of the hyperparameter you want to set, you can choose from the following options:
- Numbers
- Booleans and strings
Search spaces can contain nested search spaces as well.
var searchSpace = new SearchSpace();
searchSpace["SingleOption"] = new UniformSingleOption(min:-10f, max:10f, defaultValue=0f)
var nestedSearchSpace = new SearchSpace();
nestedSearchSpace["IntOption"] = new UniformIntOption(min:-10, max:10, defaultValue=0);
searchSpace["Nest"] = nestedSearchSpace;
Another option for customizing search ranges is by extending them. For example, SdcaOption only provides the L1Regularization
and L2Regularization
parameters. However, SdcaRegressionTrainer has more parameters you can set such as BiasLearningRate
.
To extend the search space, create a new class, such as SdcaExtendedOption
, that inherits from SdcaOption.
public class SdcaExtendedOption : SdcaOption
{
[Range(0.10f, 1f, 0.01f)]
public float BiasLearningRate {get;set;}
}
To specify the search space range, use RangeAttribute, which is equivalent to Microsoft.ML.SearchSpace.Option.
Then, anywhere you use your search space, reference the SdcaExtendedOption
instead of SdcaOption.
For example, when you initialize your search space, you can do so as follows:
var sdcaSearchSpace = new SearchSpace<SdcaExtendedOption>();
Create your own trial runner
By default, AutoML supports binary classification, multiclass classification, and regression. However, ML.NET supports many more scenarios such as:
- Recommendation
- Forecasting
- Ranking
- Image classification
- Text classification
- Sentence similarity
For scenarios that don't have preconfigured search spaces and sweepable estimators you can create your own and use a trial runner to enable AutoML for that scenario.
For example, given restaurant review data that looks like the following:
Wow... Loved this place.
1
Crust is not good.
0
You want to use the TextClassificationTrainer trainer to analyze sentiment where 0 is negative and 1 is positive. However, there is no ctx.Auto().TextClassification()
configuration.
To use AutoML with the text classification trainer, you'll have to:
Create your own search space.
// Define TextClassification search space public class TCOption { [Range(64, 128, 32)] public int BatchSize { get; set; } }
In this case, AutoML will search for different configurations of the
BatchSize
hyperparameter.Create a sweepable estimator and add it to your pipeline.
// Initialize search space var tcSearchSpace = new SearchSpace<TCOption>(); // Create factory for Text Classification trainer var tcFactory = (MLContext ctx, TCOption param) => { return ctx.MulticlassClassification.Trainers.TextClassification( sentence1ColumnName: textColumnName, batchSize:param.BatchSize); }; // Create text classification sweepable estimator var tcEstimator = ctx.Auto().CreateSweepableEstimator(tcFactory, tcSearchSpace); // Define text classification pipeline var pipeline = ctx.Transforms.Conversion.MapValueToKey(columnInference.ColumnInformation.LabelColumnName) .Append(tcEstimator);
In this example, the
TCOption
search space and a custom TextClassificationTrainer factory are used to create a sweepable estimator.Create a custom trial runner
To create a custom trial runner, implement ITrialRunner:
public class TCRunner : ITrialRunner { private readonly MLContext _context; private readonly TrainTestData _data; private readonly IDataView _trainDataset; private readonly IDataView _evaluateDataset; private readonly SweepablePipeline _pipeline; private readonly string _labelColumnName; private readonly MulticlassClassificationMetric _metric; public TCRunner( MLContext context, TrainTestData data, SweepablePipeline pipeline, string labelColumnName = "Label", MulticlassClassificationMetric metric = MulticlassClassificationMetric.MicroAccuracy) { _context = context; _data = data; _trainDataset = data.TrainSet; _evaluateDataset = data.TestSet; _labelColumnName = labelColumnName; _pipeline = pipeline; _metric = metric; } public void Dispose() { return; } // Run trial asynchronously public Task<TrialResult> RunAsync(TrialSettings settings, CancellationToken ct) { try { return Task.Run(() => Run(settings)); } catch (Exception ex) when (ct.IsCancellationRequested) { throw new OperationCanceledException(ex.Message, ex.InnerException); } catch (Exception) { throw; } } // Helper function to define trial run logic private TrialResult Run(TrialSettings settings) { try { // Initialize stop watch to measure time var stopWatch = new Stopwatch(); stopWatch.Start(); // Get pipeline parameters var parameter = settings.Parameter["_pipeline_"]; // Use parameters to build pipeline var pipeline = _pipeline.BuildFromOption(_context, parameter); // Train model var model = pipeline.Fit(_trainDataset); // Evaluate the model var predictions = model.Transform(_evaluateDataset); // Get metrics var evaluationMetrics = _context.MulticlassClassification.Evaluate(predictions, labelColumnName: _labelColumnName); var chosenMetric = GetMetric(evaluationMetrics); return new TrialResult() { Metric = chosenMetric, Model = model, TrialSettings = settings, DurationInMilliseconds = stopWatch.ElapsedMilliseconds }; } catch (Exception) { return new TrialResult() { Metric = double.MinValue, Model = null, TrialSettings = settings, DurationInMilliseconds = 0, }; } } // Helper function to choose metric used by experiment private double GetMetric(MulticlassClassificationMetrics metric) { return _metric switch { MulticlassClassificationMetric.MacroAccuracy => metric.MacroAccuracy, MulticlassClassificationMetric.MicroAccuracy => metric.MicroAccuracy, MulticlassClassificationMetric.LogLoss => metric.LogLoss, MulticlassClassificationMetric.LogLossReduction => metric.LogLossReduction, MulticlassClassificationMetric.TopKAccuracy => metric.TopKAccuracy, _ => throw new NotImplementedException(), }; } }
The
TCRunner
implementation in this example:- Extracts the hyperparameters chosen for that trial
- Uses the hyperparameters to create an ML.NET pipeline
- Uses the ML.NET pipeline to train a model
- Evaluates the model
- Returns a TrialResult object with the information for that trial
Initialize your custom trial runner
var tcRunner = new TCRunner(context: ctx, data: trainValidationData, pipeline: pipeline);
Create and configure your experiment. Use the SetTrialRunner extension method to add your custom trial runner to your experiment.
AutoMLExperiment experiment = ctx.Auto().CreateExperiment(); // Configure AutoML experiment experiment .SetPipeline(pipeline) .SetMulticlassClassificationMetric(MulticlassClassificationMetric.MicroAccuracy, labelColumn: columnInference.ColumnInformation.LabelColumnName) .SetTrainingTimeInSeconds(120) .SetDataset(trainValidationData) .SetTrialRunner(tcRunner);
Run your experiment
var tcCts = new CancellationTokenSource(); TrialResult textClassificationExperimentResults = await experiment.RunAsync(tcCts.Token);
Choose a different tuner
AutoML supports various tuning algorithms to iterate through the search space in search of the optimal hyperparameters. By default, it uses the Eci Cost Frugal tuner. Using experiment extension methods, you can choose another tuner that best fits your scenario.
Use the following methods to set your tuner:
- SMAC - SetSmacTuner
- Grid Search - SetGridSearchTuner
- Random Search - SetRandomSearchTuner
- Cost Frugal - SetCostFrugalTuner
- Eci Cost Frugal - SetEciCostFrugalTuner
For example, to use the grid search tuner, your code might look like the following:
experiment.SetGridSearchTuner();
Configure experiment monitoring
The quickest way to monitor the progress of an experiment is to define the Log event from MLContext. However, the Log event outputs a raw dump of the logs generated by AutoML during each trial. Because of the large amount of unformatted information, it's difficult.
For a more controlled monitoring experience, implement a class with the IMonitor interface.
public class AutoMLMonitor : IMonitor
{
private readonly SweepablePipeline _pipeline;
public AutoMLMonitor(SweepablePipeline pipeline)
{
_pipeline = pipeline;
}
public IEnumerable<TrialResult> GetCompletedTrials() => _completedTrials;
public void ReportBestTrial(TrialResult result)
{
return;
}
public void ReportCompletedTrial(TrialResult result)
{
var trialId = result.TrialSettings.TrialId;
var timeToTrain = result.DurationInMilliseconds;
var pipeline = _pipeline.ToString(result.TrialSettings.Parameter);
Console.WriteLine($"Trial {trialId} finished training in {timeToTrain}ms with pipeline {pipeline}");
}
public void ReportFailTrial(TrialSettings settings, Exception exception = null)
{
if (exception.Message.Contains("Operation was canceled."))
{
Console.WriteLine($"{settings.TrialId} cancelled. Time budget exceeded.");
}
Console.WriteLine($"{settings.TrialId} failed with exception {exception.Message}");
}
public void ReportRunningTrial(TrialSettings setting)
{
return;
}
}
The IMonitor interface has four lifecycle events:
Tip
Although it's not required, include your SweepablePipeline in your monitor so you can inspect the pipeline that was generated for a trial using the Parameter property of the TrialSettings.
In this example, only the ReportCompletedTrial and ReportFailTrial are implemented.
Once you've implemented your monitor, set it as part of your experiment configuration using SetMonitor.
var monitor = new AutoMLMonitor(pipeline);
experiment.SetMonitor(monitor);
Then, run your experiment:
var cts = new CancellationTokenSource();
TrialResult experimentResults = await experiment.RunAsync(cts.Token);
When you run the experiment with this implementation, the output should look similar to the following:
Trial 0 finished training in 5835ms with pipeline ReplaceMissingValues=>OneHotEncoding=>Concatenate=>FastForestRegression
Trial 1 finished training in 15080ms with pipeline ReplaceMissingValues=>OneHotEncoding=>Concatenate=>SdcaRegression
Trial 2 finished training in 3941ms with pipeline ReplaceMissingValues=>OneHotHashEncoding=>Concatenate=>FastTreeRegression
Persist trials
By default, AutoML only stores the TrialResult for the best model. However, if you wanted to persist each of the trials, you can do so from within your monitor.
Inside your monitor:
Define a property for your completed trials and a method for accessing them.
private readonly List<TrialResult> _completedTrials; public IEnumerable<TrialResult> GetCompletedTrials() => _completedTrials;
Initialize it in your constructor
public AutoMLMonitor(SweepablePipeline pipeline) { //... _completedTrials = new List<TrialResult>(); //... }
Append each trial result inside your ReportCompletedTrial lifecycle method.
public void ReportCompletedTrial(TrialResult result) { //... _completedTrials.Add(result); }
When training completes, you can access all the completed trials by calling
GetCompletedTrials
var completedTrials = monitor.GetCompletedTrials();
At this point, you can perform additional processing on the collection of completed trials. For example, you can choose a model other than the one selected by AutoML, log trial results to a database, or rebuild the pipeline from any of the completed trials.
Cancel experiments
When you run experiments asynchronously, make sure to cleanly terminate the process. To do so, use a CancellationToken.
Warning
Cancelling an experiment will not save any of the intermediary outputs. Set a checkpoint to save intermediary outputs.
var cts = new CancellationTokenSource();
TrialResult experimentResults = await experiment.RunAsync(cts.Token);
Set checkpoints
Checkpoints provide a way for you to save intermediary outputs from the training process in the event of an early termination or error. To set a checkpoint, use the SetCheckpoint extension method and provide a directory to store the intermediary outputs.
var checkpointPath = Path.Join(Directory.GetCurrentDirectory(), "automl");
experiment.SetCheckpoint(checkpointPath);
Determine feature importance
As machine learning is introduced into more aspects of everyday life such as healthcare, it's of utmost importance to understand why a machine learning model makes the decisions it does. Permutation Feature Importance (PFI) is a technique used to explain classification, ranking, and regression models. At a high level, the way it works is by randomly shuffling data one feature at a time for the entire dataset and calculating how much the performance metric of interest decreases. The larger the change, the more important that feature is. For more information on PFI, see interpret model predictions using Permutation Feature Importance.
Note
Calculating PFI can be a time consuming operation. How much time it takes to calculate is proportional to the number of feature columns you have. The more features, the longer PFI will take to run.
To determine feature importance using AutoML:
Get the best model.
var bestModel = expResult.Model;
Apply the model to your dataset.
var transformedData = bestModel.Transform(trainValidationData.TrainSet);
Calculate feature importance using PermutationFeatureImportance
In this case, the task is regression but the same concept applies to other tasks like ranking and classification.
var pfiResults = mlContext.Regression.PermutationFeatureImportance(bestModel, transformedData, permutationCount:3);
Order feature importance by changes to evaluation metrics.
var featureImportance = pfiResults.Select(x => Tuple.Create(x.Key, x.Value.Regression.RSquared)) .OrderByDescending(x => x.Item2);