Article
08/07/2015

August 2013

Volume 28 Number 8

Test Run - Converting Numeric Data to Categorical Data

James McCaffrey A fundamental task in the field of machine learning is converting numeric data to categorical data. For example, if you have a data set of people’s heights in inches, such as 59.5, 64.0 and 75.5, you might want to convert this numeric data into categorical data, for example 0, 1, and 2, to represent short, medium, and tall. Informally, this process is sometimes called binning data. In machine-learning literature, the process is usually called discretization of continuous data.

There are several scenarios where you might want to discretize data. Many machine-learning algorithms, such as naive Bayes classification and prediction, work only with categorical data. So if your raw data is numeric and you want to apply naive Bayes, you have to discretize the data. You might also have mixed numeric and categorical data, such as the data often found in an Excel spreadsheet. Very few machine-learning algorithms work with mixed data, so you can convert the numeric data to categorical data and then use a machine-learning algorithm that works with categorical data. Data clustering using category utility is an example.

Perhaps because the topic isn’t very glamorous, there are few resources available that describe exactly how to implement discretization algorithms. In this article I’ll present a powerful discretization algorithm. Although the ideas aren’t new, to the best of my knowledge the implementation presented in this article hasn’t been published before.

The best way to see where this article is headed is to examine the demo program shown in Figure 1. The demo sets up 20 data points that represent people’s heights in inches. The raw data is shown in the histogram in Figure 2. The demo analyzes the data and creates a discretizer object, then shows an internal representation of the discretizer. The discretizer maintains a copy of the distinct values from the raw data in a sorted (from low values to high) array named data. The number of categories has been computed to be three and is stored in member variable k.

Figure 1 Converting Numeric Data to Categorical Data

Figure 2 The Raw Data to Categorize

Each of the data points has been assigned to one of the three categories. Each category is encoded as a zero-based integer, 0 through 2, and assignment information is stored in an array named clustering. The first three data points (60.0, 61.0, 62.0) have been assigned to category 0. The next four data points (66.0, 67.0, 68.0, 69.0) have been assigned to category 1, and the final four data points (73.0, 74.0, 76.0, 78.0) have been assigned to category 2. The arithmetic mean of the first three data points in category 0 is 61.00, and the means of the data points in categories 1 and 2 are 67.50 and 75.25, respectively.

After creating the discretizer object, the demo program sets up three existing data values (62.0, 66.0, 73.0) and three new data values (59.5, 75.5, 80.5). The point here is that sometimes you have a fixed data set, but in other scenarios, new data is generated dynamically and must be converted. The discretizer converts each of the six numeric data points to categorical data: 0, 1, 2, and 0, 2, 2, respectively.

This article assumes you have at least intermediate-level programming skills. I coded the discretization algorithm using C#, but you shouldn’t have too much trouble refactoring the code to another language such as Python or Visual Basic. I’ve omitted most normal error-checking to keep the size of the code small and the main ideas clear. The demo code is too long to present in its entirety in this article, but the entire source code is available at msdn.microsoft.com/magazine/msdnmag0813.

Not As Easy As It Seems

At first thought, converting numeric data to categorical data seems like an easy problem. One simple approach would be to divide the raw source data into equal intervals. For example, for the data in the demo and Figure 2, the range is 78.0 - 60.0 = 18.0. Dividing this by k = 3 intervals gives an interval width of 6.0 inches. So any data between 60.0 and 66.0 would be assigned to category 0, data between 66.0 and 72.0 would be assigned to category 1, and data between 72.0 and 78.0 would be assigned to category 2. The problem with this equal-interval approach is that it ignores natural breaks in the data. If you look at Figure 2, you’ll see that a good discretizing algorithm should break data somewhere between 63.0 and 65.0, not at 66.0.

A second naive approach to discretization would be to use equal frequency. For the example data, because there are 11 distinct data points, with k = 3 categories, and 11 / 3 = 3 (with integer division truncation), you could assign the first three data points to category 0, the second three data points to category 1, and the last five data points to category 2. The equal-frequency approach also ignores natural breaks in the data.

The discretization algorithm presented in this article uses a data-clustering approach. The raw data is clustered using a k-means algorithm (which does take into account natural breaks in the data) and then uses the means generated by the clustering algorithm to categorize new data. For example, in Figure 1, the three means are 61.00, 67.50 and 75.25. To associate numeric value 62.5 to a categorical value, the discretizer determines which of the three mean values is closest to 62.5 (which is 61.0) and then assigns the cluster value associated with 61.0 (which is 0) as the category value for 62.5.

K-Means Clustering

The k-means clustering algorithm is quite simple. There are many variations of the algorithm. In its most basic form, for a given set of data points and a given number of clusters k, the initialization process assigns each data point to a randomly selected cluster. Then the means of the data points in each of the clusters are computed. Next, each data point is scanned and reassigned to the cluster that has a mean that’s closest to the data point. The compute-means, reassign-cluster steps are repeated until no data point is reassigned to a new cluster.

Overall Program Structure

The demo program shown in Figure 1 is a single console application. I used Visual Studio 2012, but the demo has no significant dependencies and any version of Visual Studio with the Microsoft .NET Framework 2.0 or greater should work. I created a new C# console application and named it BinningData. After the template code loaded, in the Solution Explorer window, I renamed file Program.cs to the more descriptive BinningProgram.cs, and Visual Studio automatically renamed the program class. At the top of the source code, I deleted all namespace references except the ones to System and Collections.Generic.

The overall program structure, with some minor edits, is presented in Figure 3. The key calling statements can be summarized like so:

double[] rawData = new int[] { 66.0, 66.0, ... };
Discretizer d = new Discretizer(rawData);
double numericVal = 75.5;
int catVal = d.Discretize(numericVal);

Figure 3 The Demo Program Structure

using System;
using System.Collections.Generic;
namespace BinningData
{
  class BinningProgram
  {
    static void Main(string[] args)
    {
      try
      {
        Console.WriteLine("\nBegin discretization of continuous data demo\n");
        double[] rawData = new double[20] {
          66, 66, 66, 67, 67, 67, 67, 68, 68, 69,
          73, 73, 73, 74, 76, 78,
          60, 61, 62, 62 };
        Console.WriteLine("Raw data:");
        ShowVector(rawData, 2, 10);
        Console.WriteLine("\nCreating a discretizer on the raw data");
        Discretizer d = new Discretizer(rawData);
        Console.WriteLine("\nDiscretizer creation complete");
        Console.WriteLine("\nDisplaying internal structure of the discretizer:\n");
        Console.WriteLine(d.ToString());
        Console.WriteLine("\nGenerating three existing and three new data values");
        double[] newData = new double[6] { 62.0, 66.0, 73.0, 59.5, 75.5, 80.5 };
        Console.WriteLine("\nData values:");
        ShowVector(newData, 2, 10);
        Console.WriteLine("\nDiscretizing the data:\n");
        for (int i = 0; i < newData.Length; ++i)
        {
          int cat = d.Discretize(newData[i]);
          Console.WriteLine(newData[i].ToString("F2") + " -> " + cat);
        }
        Console.WriteLine("\n\nEnd discretization demo");
        Console.ReadLine();
      }
      catch (Exception ex)
      {
        Console.WriteLine(ex.Message);
        Console.ReadLine();
      }
    } // Main
    public static void ShowVector(double[] vector, int decimals,
      int itemsPerRow) { . . }
  } // Program
  public class Discretizer
  {
    public Discretizer(double[] rawData) { . . }
    private static double[] GetDistinctValues(double[] array) { . . }
    private static bool AreEqual(double x1, double x2) { . . }
    public int Discretize(double x) { . . }
    public override string ToString() { . . }
    private void InitializeClustering() { . . }
    private int[] GetInitialIndexes() { . . }
    private int InitialCluster(int di, int[] initialIndexes) { . . }
    private void Cluster() { . . }
    private bool ComputeMeans() { . . }
    private bool AssignAll() { . . }
    private int MinIndex(double[] distances) { . . }
    private static double Distance(double x1, double x2) { . . }
  }
} // ns

The Discretizer Class

The Discretizer class has four data members:

private double[] data;
private int k;
private double[] means;
private int[] clustering;

The Discretizer constructor uses numeric data to enable a Discretize method that accepts a numeric value, and which returns a zero-based integer categorical value. Notice that the Discretizer determines the number of categories automatically.

Array data holds the distinct values from the raw data and is used to create the clustering. Integer k is the number of clusters to assign the data to, which is also the number of data categories. The array named means has size k and holds the arithmetic means of the data points assigned to each cluster at any given time during the execution of the clustering algorithm.

The array named clustering encodes how the data is clustered at any given point in time. The index of array clustering indicates the index of a data point stored in array data, and the value in array clustering indicates the current cluster assignment. For example, if clustering[9] = 2, then the data point at data[9] is assigned to cluster 2.

The Discretizer Constructor

The Discretizer constructor is defined as:

public Discretizer(double[] rawData)
{
  double[] sortedRawData = new double[rawData.Length];
  Array.Copy(rawData, sortedRawData, rawData.Length);
  Array.Sort(sortedRawData);
  this.data = GetDistinctValues(sortedRawData);
  this.clustering = new int[data.Length];
  this.k = (int)Math.Sqrt(data.Length); // heuristic
  this.means = new double[k];
  this.Cluster();
}

The first step is to extract the distinct values from the raw data. There are several ways to do this. The code here sorts a copy of the raw data then calls a helper method, GetDistinctValues. Once the distinct values have been determined, the clustering array can be allocated.

Here’s method GetDistinctValues:

private static double[] GetDistinctValues(double[] array)
{
  List<double> distinctList = new List<double>();
  distinctList.Add(array[0]);
  for (int i = 0; i < array.Length - 1; ++i)
    if (AreEqual(array[i], array[i + 1]) == false)
      distinctList.Add(array[i+1]);
  double[] result = new double[distinctList.Count];
  distinctList.CopyTo(result);
  return result;
}

Because the source data has been sorted, the method can perform a single scan looking for instances where two consecutive values aren’t equal. The raw data is type double, which means that comparing two values for exact equality can be dicey, so a helper method, AreEqual, is used:

private static bool AreEqual(double x1, double x2)
{
  if (Math.Abs(x1 - x2) < 0.000001) return true;
  else return false;
}

Method AreEqual uses an arbitrary closeness threshold of 0.000001. You might want to pass this value into the Discretizer object as an input parameter. A variable named epsilon is often used in this scenario.

After extracting the distinct values from the raw data, the next step is to determine k, the number of clusters, which is also the number of categories. Here a rule of thumb is used and k becomes the square root of the number of data items. An alternative is to write the constructor so that the value of k is passed in as a parameter. Determining the optimal value of k is essentially an unsolved machine-learning research problem.

After the value of k has been computed, the constructor allocates space for array means, and then calls the Cluster method. That method performs k-means clustering on the data and the values in the final means array can be used to assign a category value to any numeric value.

The Clustering Algorithm

The heart of the Discretizer class is the code that performs k-means clustering. The Cluster method is listed in Figure 4.

Figure 4 The Cluster Method

private void Cluster()
{
  InitializeClustering();
  ComputeMeans();
  bool changed = true; bool success = true;
  int ct = 0;
  int maxCt = data.Length * 10; // Heuristic
  while (changed == true && success == true && ct < maxCt) {
    ++ct;
    changed = AssignAll();
    success = ComputeMeans();
  }
}

Method Cluster is relatively short because it farms out all of the hard work to helper methods. Method InitializeClustering assigns all data points to an initial cluster. Then the means of the data points assigned to each cluster are computed using the clustering assignments.

Inside the main clustering algorithm loop, all data points are assigned to a cluster by method AssignAll. Method AssignAll calls helper methods Distance and MinIndex. Method Distance defines the distance between two data points:

private static double Distance(double x1, double x2)
{
  return Math.Sqrt((x1 - x2) * (x1 - x2));
}

Here, Euclidean distance (defined as the square root of the squared difference) is used. Because the data points are single values rather than vectors, the Euclidean distance is equivalent to Math.Abs(x1 - x2), so you might want to use this simpler computation.

The loop exits when there’s no change in the clustering array, indicated by the return value of AssignAll, or when the means array can’t be computed because the count of values assigned to a cluster is zero, or when a maximum loop counter value is reached. Here, maxCt is arbitrarily assigned a value of 10 times the number of data points. In general, the clustering algorithm here converges extremely quickly, and a loop exit condition of reaching maxCt is likely due to a logic error, so you might want to check for this.

Because the clustering process repeatedly reassigns values to clusters, it’s possible that the number of values assigned to a cluster could become zero, making a mean impossible to compute. Helper method ComputeMeans attempts to compute all k means but returns false if a count is zero. The method is presented in Figure 5.

Figure 5 The ComputeMeans Method

private bool ComputeMeans()
{
  double[] sums = new double[k];
  int[] counts = new int[k];
  for (int i = 0; i < data.Length; ++i)
  {
    int c = clustering[i]; // Cluster ID
    sums[c] += data[i];
    counts[c]++;
  }
  for (int c = 0; c < sums.Length; ++c)
  {
    if (counts[c] == 0)
      return false; // fail
    else
      sums[c] = sums[c] / counts[c];
  }
  sums.CopyTo(this.means, 0);
  return true; // Success
}

Initializing the Clustering

The clustering initialization process is a bit tricky. Suppose the data consists of 11 sorted values as shown in Figure 1, and k, the number of clusters, has been set to three. After initialization, the goal is for array member clustering to have three 0-values in cells 0 through 2, three 1-values in cells 3 through 5, and the remaining five 2-values in cells 6 through 10. In other words, the clustering should be evenly distributed by frequency.

The first step is to generate border values of {3, 6, 9}, which implicitly define intervals of 0-2, 3-5 and 6-greater. This is done by helper method GetInitialIndexes, which just divides the number of data points by the number of clusters:

private int[] GetInitialIndexes()
{
  int interval = data.Length / k;
  int[] result = new int[k];
  for (int i = 0; i < k; ++i)
    result[i] = interval * (i + 1);
  return result;
}

The second step is to define a helper method that computes the cluster value for a given data index value, using the border values:

private int InitialCluster(int di, int[] initialIndexes)
{
  for (int i = 0; i < initialIndexes.Length; ++i)
    if (di < initialIndexes[i])
      return i;
  return initialIndexes.Length - 1; // Last cluster
}

The third step is to assign all data indexes to a cluster:

private void InitializeClustering()
{
  int[] initialIndexes = GetInitialIndexes();
  for (int di = 0; di < data.Length; ++di)
  {
    int c = InitialCluster(di, initialIndexes);
    clustering[di] = c;
  }
}

In essence, the initialization process is the equal-frequency approach described previously in this article.

The Discretize Method

After the data has been clustered, the final values in the means array can be used to assign a zero-based categorical value to a numeric value. Method Discretize is:

public int Discretize(double x)
{
  double[] distances = new double[k];
  for (int c = 0; c < k; ++c)
    distances[c] = Distance(x, data[means[c]]);
  return MinIndex(distances);
}

The method computes the distance from the input value x to each of the k means and then returns the index of the closest mean, which is a cluster ID and is also a categorical value. For example, if the final means are 61.00, 67.50 and 75.25, and x is 70.00, the distance from x to mean[0] = sqrt((70.00 - 61.00)^2) = sqrt(81.00) = 9.00. Similarly, mean[1] = sqrt((70.00 - 67.50)^2) = 2.50, and mean[2] = sqrt((70.00 - 75.25)^2) = 5.25. The smallest distance is 2.50, which is at index [1], so 70.00 is assigned to categorical value 1.

Wrapping Up

The code presented in this article can be used as is to provide you with high-quality numeric-to-categorical data conversion for machine learning. You might want to encapsulate the Discretizer class in a class library rather than embedding the code directly into an application.

The primary feature you may want to customize is the determination of the number of categories, k. One possibility is to set a threshold value. Below the threshold, each data point generates a category value. For example, suppose you’re dealing with people’s ages. Suppose these ages range from 1 to 120. With only 120 distinct possible values, instead of computing k as the square root of 120 (which would give you 10 categories), you could just set k equal to 120.

Dr. James McCaffrey works for Microsoft at the company’s Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. He’s the author of “.NET Test Automation Recipes” (Apress, 2006), and can be reached at jammc@microsoft.com.

Thanks to the following technical expert for reviewing this article: Richard Hughes (Microsoft Research)