November 2018

Volume 33 Number 11

Test Run - Introduction to the ML.NET Library

By James McCaffrey

James McCaffreyThe ML.NET library is an open source collection of machine learning (ML) code that can be used directly in .NET applications. Most ML libraries, such as TensorFlow, Keras, CNTK, and PyTorch, are written in Python and call into low-level C++ routines. However, if you use a Python-based library, it’s not so easy for a .NET application to access a trained ML model. Fortunately, the ML.NET library integrates seamlessly into .NET applications.

The best way to see where this article is headed is to take a look at the demo program in Figure 1. The demo creates an ML model that predicts whether a patient will die or survive based on the patient’s age, sex, and score on a kidney medical test. Because there are only two possible outcomes, die or survive, this is a binary classification problem.

ML.NET Demo Program in Action
Figure 1 ML.NET Demo Program in Action

Behind the scenes, the demo program uses the ML.NET library to create and train a logistic regression model. As I’m writing this article, the ML.NET library is still in preview mode, so some of the information presented here may have changed by the time you’re reading this.

The demo uses a set of training data with 30 items. After the model was trained, it was applied to the source data, and achieved 66.67 percent accuracy (20 correct and 10 wrong). The demo concludes by using the trained model to predict the outcome for a 50-year-old male with a kidney test score of 4.80—the prediction is that the patient will survive.

This article assumes you have intermediate or better programming skill with C#, but doesn’t assume you know anything about the ML.NET library. The complete code and data for the demo program are presented in this article and are also available in the accompanying file download.

The Demo Program

To create the demo program, I launched Visual Studio 2017. The ML.NET library will work with either the free Community Edition or any of the commercial editions of Visual Studio 2017. The ML.NET documentation doesn’t explicitly state that Visual Studio 2017 is required, but I couldn’t get the demo program to work with Visual Studio 2015. I created a new C# console application project and named it Kidney. The ML.NET library will work with either a classic .NET or a .NET Core application type.

After the template code loaded, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to KidneyProgram.cs and I allowed Visual Studio to automatically rename class Program for me. Next, in the Solution Explorer window, I right-clicked on the Kidney project and selected the Manage NuGet Packages option. In the NuGet window, I selected the Browse tab and then entered “ML.NET” in the Search field. The ML.NET library is housed in the Microsoft.ML package. I selected that package and clicked the Install button. After a few seconds Visual Studio responded with a “successfully installed Microsoft.ML 0.3.0 to Kidney” message.

At this point I did a Build | Rebuild Solution and got a “supports only x64 architectures” error message. In the Solution Explorer window, I right-clicked on the Kidney project, and selected the Properties entry. In the Properties window, I selected the Build tab on the left, and then changed the Platform Target entry from “Any CPU” to “x64.” I also made sure I was targeting the 4.7 version of the .NET Framework. With earlier versions I got an error related to one of the math library dependencies and had to manually edit the global .csproj file. Ugh. Then I did a Build | Rebuild Solution and was successful. When working with preview-mode libraries such as ML.NET, you should expect glitches like this to be the rule rather than the exception.

The Demo Data

After creating the skeleton of the demo program, the next step was to create the training data file. The data is presented in Figure 2. In the Solution Explorer window, I right-clicked on the Kidney project and selected Add | New Item. From the new item dialog window, I selected the Text File type and named it KidneyData.txt. If you’re following along, copy the data from Figure 2 and paste it into the editor window, being careful not to have any extra trailing blank lines.

Figure 2 Kidney Data

48, +1, 4.40, survive
60, -1, 7.89, die
51, -1, 3.48, survive
66, -1, 8.41, die
40, +1, 3.05, survive
44, +1, 4.56, survive
80, -1, 6.91, die
52, -1, 5.69, survive
56, -1, 4.01, survive
55, -1, 4.48, survive
72, +1, 5.97, survive
57, -1, 6.71, die
50, -1, 6.40, survive
80, -1, 6.67, die
69, +1, 5.79, survive
39, -1, 5.42, survive
68, -1, 7.61, die
47, +1, 3.24, survive
45, +1, 4.29, survive
79, +1, 7.44, die
44, -1, 2.55, survive
52, +1, 3.71, survive
55, +1, 5.56, die
76, -1, 7.80, die
51, -1, 5.94, survive
46, +1, 5.52, survive
48, -1, 3.25, survive
58, +1, 4.71, survive
44, +1, 2.52, survive
68, -1, 8.38, die

The 30-item dataset is artificial and should be self-explanatory for the most part. The sex field is encoded as male = -1 and female = +1. Because the data has three dimensions (age, sex, test score), it’s not possible to display it in a two-dimensional graph. But you can get a good idea of the structure of the data by examining the graph of just age and kidney test score in Figure 3. The graph suggests that the data may be linearly separable. 

Kidney Data
Figure 3 Kidney Data

The Program Code

The complete demo code, with a few minor edits to save space, is presented in Figure 4. At the top of the Editor window, I removed all namespace references and replaced them with the ones shown in the code listing. The various Microsoft.ML namespaces house all ML.NET functionality. The System.Threading.Tasks namespace is needed to save or load a trained ML.NET model to file.

Figure 4 ML.NET Example Program

using System;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using Microsoft.ML.Models;
using System.Threading.Tasks;
namespace Kidney
{
  class KidneyProgram
  {
    public class KidneyData
    {
      [Column(ordinal: "0", name: "Age")]
      public float Age;
      [Column(ordinal: "1", name: "Sex")]
      public float Sex;
      [Column(ordinal: "2", name: "Kidney")]
      public float Kidney;
      [Column(ordinal: "3", name: "Label")]
      public string Label;
    }
    public class KidneyPrediction
    {
      [ColumnName("PredictedLabel")]
      public string PredictedLabels;
    }
    static void Main(string[] args)
    {
      Console.WriteLine("ML.NET (v0.3.0 preview) demo run");
      Console.WriteLine("Survival based on age, sex, kidney");
      var pipeline = new LearningPipeline();
      string dataPath = "..\\..\\KidneyData.txt";
      pipeline.Add(new TextLoader(dataPath).
        CreateFrom<KidneyData>(separator: ','));
      pipeline.Add(new Dictionarizer("Label"));
      pipeline.Add(new ColumnConcatenator("Features", "Age",
        "Sex", "Kidney"));
      pipeline.Add(new Logistic​Regression​Binary​Classifier());
      pipeline.Add(new
        PredictedLabelColumnOriginalValueConverter()
        { PredictedLabelColumn = "PredictedLabel" });
      Console.WriteLine("\nStarting training \n");
      var model = pipeline.Train<KidneyData,
        KidneyPrediction>();
      Console.WriteLine("\nTraining complete \n");
      string ModelPath = "..\\..\\KidneyModel.zip";
      Task.Run(async () =>
      {
        await model.WriteAsync(ModelPath);
      }).GetAwaiter().GetResult();
      var testData = new TextLoader(dataPath).
        CreateFrom<KidneyData>(separator: ',');
      var evaluator = new BinaryClassificationEvaluator();
      var metrics = evaluator.Evaluate(model, testData);
      double acc = metrics.Accuracy * 100;
      Console.WriteLine("Model accuracy = " +
        acc.ToString("F2") + "%");
      Console.WriteLine("Predict 50-year male, kidney 4.80:");
      KidneyData newPatient = new KidneyData()
        { Age = 50f, Sex = -1f, Kidney = 4.80f };
      KidneyPrediction prediction = model.Predict(newPatient);
      string result = prediction.PredictedLabels;
      Console.WriteLine("Prediction = " + result);
      Console.WriteLine("\nEnd ML.NET demo");
      Console.ReadLine();
    } // Main
  } // Program
} // ns

The demo program defines a class named KidneyData, nested inside the main program class, that defines the internal structure of the training data. For example, the first column is:

[Column(ordinal: "0", name: "Age")]
public float Age;

Notice that the age field is declared type float rather than type double. In ML, type float is the default numeric type because the increase in precision you get from using type double is almost never worth the resulting memory and performance penalty. The value-to-predict must use the name “Label,” but predictor field names can be whatever you like.

The demo program defines a nested class named KidneyPrediction to hold model predictions:

public class KidneyPrediction
{
  [ColumnName("PredictedLabel")]
  public string PredictedLabels;
}

The column name “PredictedLabel” is required but, as shown, the associated string identifier doesn’t have to match.

Creating and Training the Model

The demo program creates an ML model using these seven statements:

var pipeline = new LearningPipeline();
string dataPath = "..\\..\\KidneyData.txt";
pipeline.Add(new TextLoader(dataPath).
  CreateFrom<KidneyData>(separator: ','));
pipeline.Add(new Dictionarizer("Label"));
pipeline.Add(new ColumnConcatenator("Features", "Age", "Sex", "Kidney"));
pipeline.Add(new Logistic​Regression​Binary​Classifier());
pipeline.Add(new PredictedLabelColumnOriginalValueConverter()
  { PredictedLabelColumn = "PredictedLabel" });

You can think of the pipeline object as an untrained ML model plus the data needed to train the model. Recall that the values-to-­predict in the data file are either “survive” or “die.” Because ML models only understand numeric values, the weirdly named Dictionarizer class is used to encode two strings to 0 or 1. The ColumnConcatenator constructor combines the three predictor variables into an aggregate; using a string result parameter with the name of “Features” is required.

There are many different ML techniques you can use for a binary classification problem. I use logistic regression in the demo program to keep the main ideas as clear as possible because it’s arguably the simplest and most basic form of ML. Other binary classifier algorithms supported by ML.NET include AveragedPerceptronBinaryClassifier, FastForestBinaryClassifier and LightGbmClassifier.

The demo program trains and saves the model using these statements:

var model = pipeline.Train<KidneyData, KidneyPrediction>();
string ModelPath = "..\\..\\KidneyModel.zip";
Task.Run(async () =>
{
  await model.WriteAsync(ModelPath);
}).GetAwaiter().GetResult();

The ML.NET library Train method is very sophisticated. If you refer back to the screenshot in Figure 1, you can see that Train performs automatic normalization of predictor variables, which scales them so that large predictor values, such as a person’s annual income, don’t swamp small values, such as a person's number of children. The Train method also uses regularization, which is an advanced technique to improve the accuracy of a model. In short, ML.NET does all kinds of advanced processing, without you having to explicitly configure parameter values.

Saving and Evaluating the Model

After the model has been trained, it’s saved to disk like so:

string ModelPath = "..\\..\\KidneyModel.zip";
Task.Run(async () =>
{
  await model.WriteAsync(ModelPath);
}).GetAwaiter().GetResult();

Because the WriteAsync method is asynchronous, it’s not so easy to call it. The approach I prefer is the wrapper technique shown. The lack of a non-async method to save an ML.NET model is a bit surprising, even for a library that’s in preview mode.

The demo program makes the assumption that the program executable is two directories below the project root directory. In a production system you’d want to verify that the target directory exists. Typically, in an ML.NET project, I like to create a Data subdirectory and a Models subdirectory off the project root folder (Kidney in this example) and save my data and models in those directories.

The model is evaluated with these statements:

var testData = new TextLoader(dataPath).
  CreateFrom<KidneyData>(separator: ',');
var evaluator = new BinaryClassificationEvaluator();
var metrics = evaluator.Evaluate(model, testData);
double acc = metrics.Accuracy * 100;
Console.WriteLine("Model accuracy = " +
  acc.ToString("F2") + "%");

In most ML scenarios you’d have two data files—one set used just for training and a second test dataset used only for model evaluation. For simplicity, the demo program reuses the single 30-item data file for model evaluation.

The Evaluate method returns an object containing several metrics, including log loss, precision, recall, F1 score and so on. The return object also has a neat ConfusionMatrix object that can be used to display counts such as the number of patients who were predicted to survive but in fact died.

Using the Trained Model

The demo program shows how to use the trained model to make a prediction:

Console.WriteLine("Predict 50-year male kidney = 4.80:");
KidneyData newPatient = new KidneyData()
  { Age = 50f, Sex = -1f, Kidney = 4.80f };
KidneyPrediction prediction = model.Predict(newPatient);
string result = prediction.PredictedLabels;
Console.WriteLine("Prediction = " + result);

Notice that the numeric literals for age, sex and kidney score use the “f” modifier because the model is expecting type float values. In this example, the trained model was available because the program just finished training. If you wanted to make a prediction from a different program, you’d load the trained model using the ReadAsync method along the lines of:

PredictionModel<KidneyData,
  KidneyPrediction> model = null;
Task.Run(async () =>
{
  model2 = await PredictionModel.ReadAsync
  <KidneyData, KidneyPrediction>(ModelPath);
}).GetAwaiter().GetResult();

Wrapping Up

Even though the ML.NET library is new, its origins go back many years. Shortly after the introduction of the Microsoft .NET Framework in 2002, Microsoft Research began a project called “text mining search and navigation”, or TMSN, to enable software developers to include ML code in Microsoft products and technologies. The project was very successful, and over the years grew in size and usage internally at Microsoft. Somewhere around 2011 the library was renamed to “the learning code” (TLC). TLC is widely used within Microsoft and is currently in version 3.9. The ML.NET library is a direct offshoot of TLC, with Microsoft-specific features removed. 


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee and Ricky Loynd


Discuss this article in the MSDN Magazine forum