Train and evaluate a model

Learn how to build machine learning models, collect metrics, and measure performance with ML.NET. Although this sample trains a regression model, the concepts are applicable throughout a majority of the other algorithms.

Split data for training and testing

The goal of a machine learning model is to identify patterns within training data. These patterns are used to make predictions using new data.

The data can be modeled by a class like HousingData.

public class HousingData
{
    [LoadColumn(0)]
    public float Size { get; set; }

    [LoadColumn(1, 3)]
    [VectorType(3)]
    public float[] HistoricalPrices { get; set; }

    [LoadColumn(4)]
    [ColumnName("Label")]
    public float CurrentPrice { get; set; }
}

Given the following data which is loaded into an IDataView.

HousingData[] housingData = new HousingData[]
{
    new HousingData
    {
        Size = 600f,
        HistoricalPrices = new float[] { 100000f ,125000f ,122000f },
        CurrentPrice = 170000f
    },
    new HousingData
    {
        Size = 1000f,
        HistoricalPrices = new float[] { 200000f, 250000f, 230000f },
        CurrentPrice = 225000f
    },
    new HousingData
    {
        Size = 1000f,
        HistoricalPrices = new float[] { 126000f, 130000f, 200000f },
        CurrentPrice = 195000f
    },
    new HousingData
    {
        Size = 850f,
        HistoricalPrices = new float[] { 150000f,175000f,210000f },
        CurrentPrice = 205000f
    },
    new HousingData
    {
        Size = 900f,
        HistoricalPrices = new float[] { 155000f, 190000f, 220000f },
        CurrentPrice = 210000f
    },
    new HousingData
    {
        Size = 550f,
        HistoricalPrices = new float[] { 99000f, 98000f, 130000f },
        CurrentPrice = 180000f
    }
};

Use the TrainTestSplit method to split the data into train and test sets. The result will be a TrainTestData object which contains two IDataView members, one for the train set and the other for the test set. The data split percentage is determined by the testFraction parameter. The snippet below is holding out 20 percent of the original data for the test set.

DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
IDataView trainData = dataSplit.TrainSet;
IDataView testData = dataSplit.TestSet;

Prepare the data

The data needs to be pre-processed before training a machine learning model. More information on data preparation can be found on the data prep how-to article as well as the transforms page.

ML.NET algorithms have constraints on input column types. Additionally, default values are used for input and output column names when no values are specified.

Working with expected column types

The machine learning algorithms in ML.NET expect a float vector of known size as input. Apply the VectorType attribute to your data model when all of the data is already in numerical format and is intended to be processed together (i.e. image pixels).

If data is not all numerical and you want to apply different data transformations on each of the columns individually, use the Concatenate method after all of the columns have been processed to combine all of the individual columns into a single feature vector that is output to a new column.

The following snippet combines the Size and HistoricalPrices columns into a single feature vector that is output to a new column called Features. Because there is a difference in scales, NormalizeMinMax is applied to the Features column to normalize the data.

// Define Data Prep Estimator
// 1. Concatenate Size and Historical into a single feature vector output to a new column called Features
// 2. Normalize Features vector
IEstimator<ITransformer> dataPrepEstimator =
    mlContext.Transforms.Concatenate("Features", "Size", "HistoricalPrices")
        .Append(mlContext.Transforms.NormalizeMinMax("Features"));

// Create data prep transformer
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(trainData);

// Apply transforms to training data
IDataView transformedTrainingData = dataPrepTransformer.Transform(trainData);

Working with default column names

ML.NET algorithms use default column names when none are specified. All trainers have a parameter called featureColumnName for the inputs of the algorithm and when applicable they also have a parameter for the expected value called labelColumnName. By default those values are Features and Label respectively.

By using the Concatenate method during pre-processing to create a new column called Features, there is no need to specify the feature column name in the parameters of the algorithm since it already exists in the pre-processed IDataView. The label column is CurrentPrice, but since the ColumnName attribute is used in the data model, ML.NET renames the CurrentPrice column to Label which removes the need to provide the labelColumnName parameter to the machine learning algorithm estimator.

If you don't want to use the default column names, pass in the names of the feature and label columns as parameters when defining the machine learning algorithm estimator as demonstrated by the subsequent snippet:

var UserDefinedColumnSdcaEstimator = mlContext.Regression.Trainers.Sdca(labelColumnName: "MyLabelColumnName", featureColumnName: "MyFeatureColumnName");

Caching data

By default, when data is processed, it is lazily loaded or streamed which means that trainers may load the data from disk and iterate over it multiple times during training. Therefore, caching is recommended for datasets that fit into memory to reduce the number of times data is loaded from disk. Caching is done as part of an EstimatorChain by using AppendCacheCheckpoint.

It's recommended to use AppendCacheCheckpoint before any trainers in the pipeline.

Using the following EstimatorChain, adding AppendCacheCheckpoint before the StochasticDualCoordinateAscent trainer caches the results of the previous estimators for later use by the trainer.

// 1. Concatenate Size and Historical into a single feature vector output to a new column called Features
// 2. Normalize Features vector
// 3. Cache prepared data
// 4. Use Sdca trainer to train the model
IEstimator<ITransformer> dataPrepEstimator =
    mlContext.Transforms.Concatenate("Features", "Size", "HistoricalPrices")
        .Append(mlContext.Transforms.NormalizeMinMax("Features"))
        .AppendCacheCheckpoint(mlContext);
        .Append(mlContext.Regression.Trainers.Sdca());

Train the machine learning model

Once the data is pre-processed, use the Fit method to train the machine learning model with the StochasticDualCoordinateAscent regression algorithm.

// Define StochasticDualCoordinateAscent regression algorithm estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();

// Build machine learning model
var trainedModel = sdcaEstimator.Fit(transformedTrainingData);

Extract model parameters

After the model has been trained, extract the learned ModelParameters for inspection or retraining. The LinearRegressionModelParameters provide the bias and learned coefficients or weights of the trained model.

var trainedModelParameters = trainedModel.Model as LinearRegressionModelParameters;

Note

Other models have parameters that are specific to their tasks. For example, the K-Means algorithm puts data into cluster based on centroids and the KMeansModelParameters contains a property that stores these learned centroids. To learn more, visit the Microsoft.ML.Trainers API Documentation and look for classes that contain ModelParameters in their name.

Evaluate model quality

To help choose the best performing model, it is essential to evaluate its performance on test data. Use the Evaluate method, to measure various metrics for the trained model.

Note

The Evaluate method produces different metrics depending on which machine learning task was performed. For more details, visit the Microsoft.ML.Data API Documentation and look for classes that contain Metrics in their name.

// Measure trained model performance
// Apply data prep transformer to test data
IDataView transformedTestData = dataPrepTransformer.Transform(testData);

// Use trained model to make inferences on test data
IDataView testDataPredictions = trainedModel.Transform(transformedTestData);

// Extract model metrics and get RSquared
RegressionMetrics trainedModelMetrics = mlContext.Regression.Evaluate(testDataPredictions);
double rSquared = trainedModelMetrics.RSquared;

In the previous code sample:

  1. Test data set is pre-processed using the data preparation transforms previously defined.
  2. The trained machine learning model is used to make predictions on the test data.
  3. In the Evaluate method, the values in the CurrentPrice column of the test data set are compared against the Score column of the newly output predictions to calculate the metrics for the regression model, one of which, R-Squared is stored in the rSquared variable.

Note

In this small example, the R-Squared is a number not in the range of 0-1 because of the limited size of the data. In a real-world scenario, you should expect to see a value between 0 and 1.