Tutorial: Categorize support issues using multiclass classification with ML.NET

2023-05-02

This sample tutorial illustrates using ML.NET to create a GitHub issue classifier to train a model that classifies and predicts the Area label for a GitHub issue via a .NET console application using C# in Visual Studio.

In this tutorial, you learn how to:

Prepare your data
Transform the data
Train the model
Evaluate the model
Predict with the trained model
Deploy and Predict with a loaded model

You can find the source code for this tutorial at the dotnet/samples repository.

Prerequisites

Visual Studio 2022 with the ".NET Desktop Development" workload installed.
The GitHub issues training tab-separated file (issues_train.tsv).
The GitHub issues test tab-separated file (issues_test.tsv).

Create a console application

Create a project

Create a C# Console Application called "GitHubIssueClassification". Select Next.
Choose .NET 7 as the framework to use. Select Create.
Create a directory named Data in your project to save your data set files:

In Solution Explorer, right-click on your project and select Add > New Folder. Type "Data" and press Enter.
Create a directory named Models in your project to save your model:

In Solution Explorer, right-click on your project and select Add > New Folder. Type "Models" and press Enter.
Install the Microsoft.ML NuGet Package:

Note

This sample uses the latest stable version of the NuGet packages mentioned unless otherwise stated.

In Solution Explorer, right-click on your project and select Manage NuGet Packages. Choose "nuget.org" as the Package source, select the Browse tab, search for Microsoft.ML and select Install. Select the OK button on the Preview Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with the license terms for the packages listed.

Prepare your data

Download the issues_train.tsv and the issues_test.tsv data sets and save them to the Data folder you created previously. The first dataset trains the machine learning model and the second can be used to evaluate how accurate your model is.
In Solution Explorer, right-click each of the *.tsv files and select Properties. Under Advanced, change the value of Copy to Output Directory to Copy if newer.

Create classes and define paths

Add the following additional using directives to the top of the Program.cs file:

using Microsoft.ML;
using GitHubIssueClassification;

Create three global fields to hold the paths to the recently downloaded files, and global variables for the MLContext, DataView, and PredictionEngine:

_trainDataPath has the path to the dataset used to train the model.
_testDataPath has the path to the dataset used to evaluate the model.
_modelPath has the path where the trained model is saved.
_mlContext is the MLContext that provides processing context.
_trainingDataView is the IDataView used to process the training dataset.
_predEngine is the PredictionEngine<TSrc,TDst> used for single predictions.

Add the following code to the line directly below the using directives to specify those paths and the other variables:

string _appPath = Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]) ?? ".";
string _trainDataPath = Path.Combine(_appPath, "..", "..", "..", "Data", "issues_train.tsv");
string _testDataPath = Path.Combine(_appPath, "..", "..", "..", "Data", "issues_test.tsv");
string _modelPath = Path.Combine(_appPath, "..", "..", "..", "Models", "model.zip");

MLContext _mlContext;
PredictionEngine<GitHubIssue, IssuePrediction> _predEngine;
ITransformer _trainedModel;
IDataView _trainingDataView;

Create some classes for your input data and predictions. Add a new class to your project:

In Solution Explorer, right-click the project, and then select Add > New Item.
In the Add New Item dialog box, select Class and change the Name field to GitHubIssueData.cs. Then, select Add.

The GitHubIssueData.cs file opens in the code editor. Add the following using directive to the top of GitHubIssueData.cs:
```
using Microsoft.ML.Data;
```
Remove the existing class definition and add the following code to the GitHubIssueData.cs file. This code has two classes, GitHubIssue and IssuePrediction.
```
public class GitHubIssue
{
    [LoadColumn(0)]
    public string? ID { get; set; }
    [LoadColumn(1)]
    public string? Area { get; set; }
    [LoadColumn(2)]
    public required string Title { get; set; }
    [LoadColumn(3)]
    public required string Description { get; set; }
}

public class IssuePrediction
{
    [ColumnName("PredictedLabel")]
    public string? Area;
}
```
The label is the column you want to predict. The identified Features are the inputs you give the model to predict the Label.

Use the LoadColumnAttribute to specify the indices of the source columns in the data set.

GitHubIssue is the input dataset class and has the following String fields:
- The first column ID (GitHub Issue ID).
- The second column Area (the prediction for training).
- The third column Title (GitHub issue title) is the first feature used for predicting the Area.
- The fourth column Description is the second feature used for predicting the Area.
IssuePrediction is the class used for prediction after the model has been trained. It has a single string (Area) and a PredictedLabel ColumnName attribute. The PredictedLabel is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.

All ML.NET operations start in the MLContext class. Initializing mlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to DBContext in Entity Framework.

Initialize variables

Initialize the _mlContext global variable with a new instance of MLContext with a random seed (seed: 0) for repeatable/deterministic results across multiple trainings. Replace the Console.WriteLine("Hello World!") line with the following code:

_mlContext = new MLContext(seed: 0);

Load the data

ML.NET uses the IDataView interface as a flexible, efficient way of describing numeric or text tabular data. IDataView can load either text files or in real time (for example, SQL database or log files).

To initialize and load the _trainingDataView global variable in order to use it for the pipeline, add the following code after the mlContext initialization:

_trainingDataView = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_trainDataPath,hasHeader: true);

The LoadFromTextFile() defines the data schema and reads in the file. It takes in the data path variables and returns an IDataView.

Add the following after calling the LoadFromTextFile() method:

var pipeline = ProcessData();

The ProcessData method executes the following tasks:

Extracts and transforms the data.
Returns the processing pipeline.

Create the ProcessData method at the bottom of the Program.cs file using the following code:

IEstimator<ITransformer> ProcessData()
{

}

Extract features and transform the data

As you want to predict the Area GitHub label for a GitHubIssue, use the MapValueToKey() method to transform the Area column into a numeric key type Label column (a format accepted by classification algorithms) and add it as a new dataset column:

var pipeline = _mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "Area", outputColumnName: "Label")

Next, call mlContext.Transforms.Text.FeaturizeText, which transforms the text (Title and Description) columns into a numeric vector for each called TitleFeaturized and DescriptionFeaturized. Append the featurization for both columns to the pipeline with the following code:

.Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Title", outputColumnName: "TitleFeaturized"))
.Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description", outputColumnName: "DescriptionFeaturized"))

The last step in data preparation combines all of the feature columns into the Features column using the Concatenate() method. By default, a learning algorithm processes only features from the Features column. Append this transformation to the pipeline with the following code:

.Append(_mlContext.Transforms.Concatenate("Features", "TitleFeaturized", "DescriptionFeaturized"))

Next, append a AppendCacheCheckpoint to cache the DataView so when you iterate over the data multiple times using the cache might get better performance, as with the following code:

.AppendCacheCheckpoint(_mlContext);

Warning

Use AppendCacheCheckpoint for small/medium datasets to lower training time. Do NOT use it (remove .AppendCacheCheckpoint()) when handling very large datasets.

Return the pipeline at the end of the ProcessData method.

return pipeline;

This step handles preprocessing/featurization. Using additional components available in ML.NET can enable better results with your model.

Build and train the model

Add the following call to the BuildAndTrainModelmethod as the next line after the call to the ProcessData() method:

var trainingPipeline = BuildAndTrainModel(_trainingDataView, pipeline);

The BuildAndTrainModel method executes the following tasks:

Creates the training algorithm class.
Trains the model.
Predicts area based on training data.
Returns the model.

Create the BuildAndTrainModel method, just after the declaration of the ProcessData() method, using the following code:

IEstimator<ITransformer> BuildAndTrainModel(IDataView trainingDataView, IEstimator<ITransformer> pipeline)
{

}

About the classification task

Classification is a machine learning task that uses data to determine the category, type, or class of an item or row of data and is frequently one of the following types:

Binary: either A or B.
Multiclass: multiple categories that can be predicted by using a single model.

For this type of problem, use a Multiclass classification learning algorithm, since your issue category prediction can be one of multiple categories (multiclass) rather than just two (binary).

Append the machine learning algorithm to the data transformation definitions by adding the following as the first line of code in BuildAndTrainModel():

var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features"))
        .Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));

The SdcaMaximumEntropy is your multiclass classification training algorithm. This is appended to the pipeline and accepts the featurized Title and Description (Features) and the Label input parameters to learn from the historic data.

Train the model

Fit the model to the splitTrainSet data and return the trained model by adding the following as the next line of code in the BuildAndTrainModel() method:

_trainedModel = trainingPipeline.Fit(trainingDataView);

The Fit()method trains your model by transforming the dataset and applying the training.

The PredictionEngine is a convenience API that allows you to pass in and then perform a prediction on a single instance of data. Add this as the next line in the BuildAndTrainModel() method:

_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(_trainedModel);

Predict with the trained model

Add a GitHub issue to test the trained model's prediction in the Predict method by creating an instance of GitHubIssue:

GitHubIssue issue = new GitHubIssue() {
    Title = "WebSockets communication is slow in my machine",
    Description = "The WebSockets communication used under the covers by SignalR looks like is going slow in my development machine.."
};

Use the Predict() function to make a prediction on a single row of data:

var prediction = _predEngine.Predict(issue);

Use the model: Prediction results

Display GitHubIssue and corresponding Area label prediction in order to share the results and act on them accordingly. Create a display for the results using the following Console.WriteLine() code:

Console.WriteLine($"=============== Single Prediction just-trained-model - Result: {prediction.Area} ===============");

Return the model trained to use for evaluation

Return the model at the end of the BuildAndTrainModel method.

return trainingPipeline;

Evaluate the model

Now that you've created and trained the model, you need to evaluate it with a different dataset for quality assurance and validation. In the Evaluate method, the model created in BuildAndTrainModel is passed in to be evaluated. Create the Evaluate method, just after BuildAndTrainModel, as in the following code:

void Evaluate(DataViewSchema trainingDataViewSchema)
{

}

The Evaluate method executes the following tasks:

Loads the test dataset.
Creates the multiclass evaluator.
Evaluates the model and create metrics.
Displays the metrics.

Add a call to the new method, right under the BuildAndTrainModel method call, using the following code:

Evaluate(_trainingDataView.Schema);

As you did previously with the training dataset, load the test dataset by adding the following code to the Evaluate method:

var testDataView = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_testDataPath,hasHeader: true);

The Evaluate() method computes the quality metrics for the model using the specified dataset. It returns a MulticlassClassificationMetrics object that contains the overall metrics computed by multiclass classification evaluators. To display the metrics to determine the quality of the model, you need to get them first. Notice the use of the Transform() method of the machine learning _trainedModel global variable (an ITransformer) to input the features and return predictions. Add the following code to the Evaluate method as the next line:

var testMetrics = _mlContext.MulticlassClassification.Evaluate(_trainedModel.Transform(testDataView));

The following metrics are evaluated for multiclass classification:

Micro Accuracy - Every sample-class pair contributes equally to the accuracy metric. You want Micro Accuracy to be as close to one as possible.
Macro Accuracy - Every class contributes equally to the accuracy metric. Minority classes are given equal weight as the larger classes. You want Macro Accuracy to be as close to one as possible.
Log-loss - see Log Loss. You want Log-loss to be as close to zero as possible.
Log-loss reduction - Ranges from [-inf, 1.00], where 1.00 is perfect predictions and 0 indicates mean predictions. You want Log-loss reduction to be as close to one as possible.

Display the metrics for model validation

Use the following code to display the metrics, share the results, and then act on them:

Console.WriteLine($"*************************************************************************************************************");
Console.WriteLine($"*       Metrics for Multi-class Classification model - Test Data     ");
Console.WriteLine($"*------------------------------------------------------------------------------------------------------------");
Console.WriteLine($"*       MicroAccuracy:    {testMetrics.MicroAccuracy:0.###}");
Console.WriteLine($"*       MacroAccuracy:    {testMetrics.MacroAccuracy:0.###}");
Console.WriteLine($"*       LogLoss:          {testMetrics.LogLoss:#.###}");
Console.WriteLine($"*       LogLossReduction: {testMetrics.LogLossReduction:#.###}");
Console.WriteLine($"*************************************************************************************************************");

Save the model to a file

Once satisfied with your model, save it to a file to make predictions at a later time or in another application. Add the following code to the Evaluate method.

SaveModelAsFile(_mlContext, trainingDataViewSchema, _trainedModel);

Create the SaveModelAsFile method below your Evaluate method.

void SaveModelAsFile(MLContext mlContext,DataViewSchema trainingDataViewSchema, ITransformer model)
{

}

Add the following code to your SaveModelAsFile method. This code uses the Save method to serialize and store the trained model as a zip file.

mlContext.Model.Save(model, trainingDataViewSchema, _modelPath);

Deploy and Predict with a model

Add a call to the new method, right under the Evaluate method call, using the following code:

PredictIssue();

Create the PredictIssue method, just after the Evaluate method (and just before the SaveModelAsFile method), using the following code:

void PredictIssue()
{

}

The PredictIssue method executes the following tasks:

Loads the saved model.
Creates a single issue of test data.
Predicts area based on test data.
Combines test data and predictions for reporting.
Displays the predicted results.

Load the saved model into your application by adding the following code to the PredictIssue method:

ITransformer loadedModel = _mlContext.Model.Load(_modelPath, out var modelInputSchema);

Add a GitHub issue to test the trained model's prediction in the Predict method by creating an instance of GitHubIssue:

GitHubIssue singleIssue = new GitHubIssue() { Title = "Entity Framework crashes", Description = "When connecting to the database, EF is crashing" };

As you did previously, create a PredictionEngine instance with the following code:

_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(loadedModel);

The PredictionEngine is a convenience API that allows you to perform a prediction on a single instance of data. PredictionEngine is not thread-safe. It's acceptable to use in single-threaded or prototype environments. For improved performance and thread safety in production environments, use the PredictionEnginePool service, which creates an ObjectPool of PredictionEngine objects for use throughout your application. See this guide on how to use PredictionEnginePool in an ASP.NET Core Web API.

Note

PredictionEnginePool service extension is currently in preview.

Use the PredictionEngine to predict the Area GitHub label by adding the following code to the PredictIssue method for the prediction:

var prediction = _predEngine.Predict(singleIssue);

Use the loaded model for prediction

Display Area in order to categorize the issue and act on it accordingly. Create a display for the results using the following Console.WriteLine() code:

Console.WriteLine($"=============== Single Prediction - Result: {prediction.Area} ===============");

Results

Your results should be similar to the following. As the pipeline processes, it displays messages. You might see warnings, or processing messages. These messages have been removed from the following results for clarity.

=============== Single Prediction just-trained-model - Result: area-System.Net ===============
*************************************************************************************************************
*       Metrics for Multi-class Classification model - Test Data
*------------------------------------------------------------------------------------------------------------
*       MicroAccuracy:    0.738
*       MacroAccuracy:    0.668
*       LogLoss:          .919
*       LogLossReduction: .643
*************************************************************************************************************
=============== Single Prediction - Result: area-System.Data ===============

Congratulations! You've now successfully built a machine-learning model for classifying and predicting an Area label for a GitHub issue. You can find the source code for this tutorial at the dotnet/samples repository.

Next steps

In this tutorial, you learned how to:

Prepare your data
Transform the data
Train the model
Evaluate the model
Predict with the trained model
Deploy and Predict with a loaded model

Advance to the next tutorial to learn more.

Taxi Fare Predictor

Share via