Train and evaluate a model
Learn how to build machine learning models, collect metrics, and measure performance with ML.NET. Although this sample trains a regression model, the concepts are applicable throughout a majority of the other algorithms.
Split data for training and testing
The goal of a machine learning model is to identify patterns within training data. These patterns are used to make predictions using new data.
The data can be modeled by a class like HousingData
.
public class HousingData
{
[LoadColumn(0)]
public float Size { get; set; }
[LoadColumn(1, 3)]
[VectorType(3)]
public float[] HistoricalPrices { get; set; }
[LoadColumn(4)]
[ColumnName("Label")]
public float CurrentPrice { get; set; }
}
Given the following data which is loaded into an IDataView
.
HousingData[] housingData = new HousingData[]
{
new HousingData
{
Size = 600f,
HistoricalPrices = new float[] { 100000f ,125000f ,122000f },
CurrentPrice = 170000f
},
new HousingData
{
Size = 1000f,
HistoricalPrices = new float[] { 200000f, 250000f, 230000f },
CurrentPrice = 225000f
},
new HousingData
{
Size = 1000f,
HistoricalPrices = new float[] { 126000f, 130000f, 200000f },
CurrentPrice = 195000f
},
new HousingData
{
Size = 850f,
HistoricalPrices = new float[] { 150000f,175000f,210000f },
CurrentPrice = 205000f
},
new HousingData
{
Size = 900f,
HistoricalPrices = new float[] { 155000f, 190000f, 220000f },
CurrentPrice = 210000f
},
new HousingData
{
Size = 550f,
HistoricalPrices = new float[] { 99000f, 98000f, 130000f },
CurrentPrice = 180000f
}
};
Use the TrainTestSplit
method to split the data into train and test sets. The result will be a TrainTestData
object which contains two IDataView
members, one for the train set and the other for the test set. The data split percentage is determined by the testFraction
parameter. The snippet below is holding out 20 percent of the original data for the test set.
DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
IDataView trainData = dataSplit.TrainSet;
IDataView testData = dataSplit.TestSet;
Prepare the data
The data needs to be pre-processed before training a machine learning model. More information on data preparation can be found on the data prep how-to article as well as the transforms page
.
ML.NET algorithms have constraints on input column types. Additionally, default values are used for input and output column names when no values are specified.
Working with expected column types
The machine learning algorithms in ML.NET expect a float vector of known size as input. Apply the VectorType
attribute to your data model when all of the data is already in numerical format and is intended to be processed together (i.e. image pixels).
If data is not all numerical and you want to apply different data transformations on each of the columns individually, use the Concatenate
method after all of the columns have been processed to combine all of the individual columns into a single feature vector that is output to a new column.
The following snippet combines the Size
and HistoricalPrices
columns into a single feature vector that is output to a new column called Features
. Because there is a difference in scales, NormalizeMinMax
is applied to the Features
column to normalize the data.
// Define Data Prep Estimator
// 1. Concatenate Size and Historical into a single feature vector output to a new column called Features
// 2. Normalize Features vector
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", "Size", "HistoricalPrices")
.Append(mlContext.Transforms.NormalizeMinMax("Features"));
// Create data prep transformer
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(trainData);
// Apply transforms to training data
IDataView transformedTrainingData = dataPrepTransformer.Transform(trainData);
Working with default column names
ML.NET algorithms use default column names when none are specified. All trainers have a parameter called featureColumnName
for the inputs of the algorithm and when applicable they also have a parameter for the expected value called labelColumnName
. By default those values are Features
and Label
respectively.
By using the Concatenate
method during pre-processing to create a new column called Features
, there is no need to specify the feature column name in the parameters of the algorithm since it already exists in the pre-processed IDataView
. The label column is CurrentPrice
, but since the ColumnName
attribute is used in the data model, ML.NET renames the CurrentPrice
column to Label
which removes the need to provide the labelColumnName
parameter to the machine learning algorithm estimator.
If you don't want to use the default column names, pass in the names of the feature and label columns as parameters when defining the machine learning algorithm estimator as demonstrated by the subsequent snippet:
var UserDefinedColumnSdcaEstimator = mlContext.Regression.Trainers.Sdca(labelColumnName: "MyLabelColumnName", featureColumnName: "MyFeatureColumnName");
Caching data
By default, when data is processed, it is lazily loaded or streamed which means that trainers may load the data from disk and iterate over it multiple times during training. Therefore, caching is recommended for datasets that fit into memory to reduce the number of times data is loaded from disk. Caching is done as part of an EstimatorChain
by using AppendCacheCheckpoint
.
It's recommended to use AppendCacheCheckpoint
before any trainers in the pipeline.
Using the following EstimatorChain
, adding AppendCacheCheckpoint
before the StochasticDualCoordinateAscent
trainer caches the results of the previous estimators for later use by the trainer.
// 1. Concatenate Size and Historical into a single feature vector output to a new column called Features
// 2. Normalize Features vector
// 3. Cache prepared data
// 4. Use Sdca trainer to train the model
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", "Size", "HistoricalPrices")
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.AppendCacheCheckpoint(mlContext);
.Append(mlContext.Regression.Trainers.Sdca());
Train the machine learning model
Once the data is pre-processed, use the Fit
method to train the machine learning model with the StochasticDualCoordinateAscent
regression algorithm.
// Define StochasticDualCoordinateAscent regression algorithm estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// Build machine learning model
var trainedModel = sdcaEstimator.Fit(transformedTrainingData);
Extract model parameters
After the model has been trained, extract the learned ModelParameters
for inspection or retraining. The LinearRegressionModelParameters
provide the bias and learned coefficients or weights of the trained model.
var trainedModelParameters = trainedModel.Model as LinearRegressionModelParameters;
Note
Other models have parameters that are specific to their tasks. For example, the K-Means algorithm puts data into cluster based on centroids and the KMeansModelParameters
contains a property that stores these learned centroids. To learn more, visit the Microsoft.ML.Trainers
API Documentation and look for classes that contain ModelParameters
in their name.
Evaluate model quality
To help choose the best performing model, it is essential to evaluate its performance on test data. Use the Evaluate
method, to measure various metrics for the trained model.
Note
The Evaluate
method produces different metrics depending on which machine learning task was performed. For more details, visit the Microsoft.ML.Data
API Documentation and look for classes that contain Metrics
in their name.
// Measure trained model performance
// Apply data prep transformer to test data
IDataView transformedTestData = dataPrepTransformer.Transform(testData);
// Use trained model to make inferences on test data
IDataView testDataPredictions = trainedModel.Transform(transformedTestData);
// Extract model metrics and get RSquared
RegressionMetrics trainedModelMetrics = mlContext.Regression.Evaluate(testDataPredictions);
double rSquared = trainedModelMetrics.RSquared;
In the previous code sample:
- Test data set is pre-processed using the data preparation transforms previously defined.
- The trained machine learning model is used to make predictions on the test data.
- In the
Evaluate
method, the values in theCurrentPrice
column of the test data set are compared against theScore
column of the newly output predictions to calculate the metrics for the regression model, one of which, R-Squared is stored in therSquared
variable.
Note
In this small example, the R-Squared is a number not in the range of 0-1 because of the limited size of the data. In a real-world scenario, you should expect to see a value between 0 and 1.