February 2013

Volume 28 Number 02

Test Run - Naive Bayes Classification with C#

By James McCaffrey

James McCaffreyNaive Bayes classification is a machine-learning technique that can be used to predict to which category a particular data case belongs. In this article I explain how Naive Bayes classification works and present an example coded with the C# language.

There are plenty of standalone tools available that can perform Naive Bayes classification. However, these tools can be difficult or impossible to integrate directly into your application, and difficult to customize to meet specific needs. And they might have hidden copyright issues. This article will give you a solid foundation for adding Naive Bayes classification features to a .NET application, without relying on any external dependencies.

The best way to understand what Naive Bayes classification is and to see where I’m headed in the article is to examine the screenshot of a demo program in Figure 1. The demo program begins by generating 40 lines of data that will be used to train the classifier. In most cases you’d be using an existing data source, but I generated dummy data to keep the demo simple. The first line of data is “administrative,right,72.0,female.” The first field is an occupation, the second is hand dominance, the third is height in inches and the fourth is sex. The goal of the classifier is to predict sex from a given set of values for occupation, dominance and height. Because the dependent variable sex has two possible values, this is an example of binary classification.

Naive Bayes Classification Demo
Figure 1 Naive Bayes Classification Demo

After generating raw data, the demo program converts each numeric height field to a category—short, medium or tall—by binning height. As I’ll explain, binning numeric data into categorical data is an approach that has pros and cons. After the training data has been binned, the demo program scans the 40 lines of categorical data and computes joint counts. For example, the number of data cases where the person’s occupation is administrative and the person’s sex is male is 2. Additionally, the total numbers of each dependent value (the attribute to be predicted, male or female in this example) are computed. You can see that there are 24 males and 16 females in the training data.

The demo program then has all the information needed to classify the sex of a new data case where the occupation is education, the dominance is right and the height is tall. In this example, it turns out the demo determined the probability that the data case is a male is 0.3855 and the probability that the case is female is 0.6145, and so the system concludes the data case is most likely a female.

In the sections that follow I’ll first explain exactly how Naive Bayes classification works, walk you through the code in the demo program, and describe how to modify the demo to meet your own needs. This article assumes you have at least beginning programming skills with a C-family language, but doesn’t assume you know anything about Naive Bayes classification. The code for the demo program is a bit too long to present in its entirety here, but the complete source is available from the MSDN download site at msdn.com/magazine/msdnmag0213.

How Naive Bayes Classification Works

Using the example shown in Figure 1, the goal is to predict the sex (male or female) of a person whose occupation is education, who is right-handed and whose height is tall (greater than or equal to 71.0 inches). To do this, we can compute the probability that the person is male given that information, and the probability the person is female given the information, and then predict the sex with the larger probability. Expressed symbolically, we want to know P(male | X), usually read as, “the probability of male given independent variable values X”) and P(female | X), where X is (education, right, tall). The term “naive” in Naive Bayes means that all X attributes are assumed to be mathematically independent, which greatly simplifies classification. You can find many online references that explain the rather interesting mathematics behind Naive Bayes classification, but the result is relatively simple. Symbolically:

P(male | X) =
  [ P(education | male) * P(right | male) * P(tall | male) * P(male) ] /
    [ PP(male | X) + PP(female | X) ]

Notice the equation is a fraction. The numerator, sometimes loosely called a partial probability, consists of four terms multiplied together. In this article I use the nonstandard notation of PP for a partial probability term. The denominator is the sum of two terms, one of which is the numerator. The first piece to compute is P(education | male), or the probability that a person’s occupation is education, given that he is male. This, as it turns out, can be estimated by the count of training cases where occupation is education and sex is male, divided by the number of cases that are male (with any occupation), so:

P(education | male ) = count(education & male) / count(male) = 2/24 = 0.0833

Using the same logic:

P(right | male) = count(right & male) / count(male) = 17/24 = 0.7083
P(tall | male) = count(tall & male) / count(male) = 4/24 = 0.1667

The next piece of the puzzle is P(male). In Naive Bayes terminology, this is called a prior. There’s some debate about how best to compute priors. On the one hand, we can hypothesize that there’s no reason to believe that the presence of males is more or less likely than the presence of females and so assign 0.5 to P(male). On the other hand, we can use the fact that the training data has 24 males and 16 females and estimate a probability of 24/40 = 0.6000 for P(male). I prefer this approach, where priors are estimated using training data.

Now, if you refer to the earlier equation for P(male | X), you’ll note that it contains the PP(female | X). The bottom sum, PP(male | X) + PP(female | X), is sometimes called the evidence. The pieces for PP(female | X) are computed like so:

P(education | female) = count(education & female) / count(female) = 4/16 = 0.2500
P(right | female) = count(right & female) / count(female) = 14/16 = 0.8750
P(tall | female) = count(tall & female) / count(female) = 2/16 = 0.1250
P(female) = 16/40 = 0.4000

So the partial probability numerator for P(male | X) is:

PP(male | X) = 0.0833 * 0.7083 * 0.1667 * 0.6000 = 0.005903

Using the same logic, the partial probability for female given X = (education, right, tall) is:

PP(female | X) = 0.2500 * 0.8750 * 0.1250 * 0.4000 = 0.010938

And, finally, the overall probabilities of male and female are:

P(male | X) = 0.005903 / (0.005903 + 0.010938) = 0.3505
P(female | X) = 0.010938 / (0.005903 + 0.010938) = 0.6495

These overall probabilities are sometimes called the posteriors. Because P(female | X) is greater than P(male | X), the system concludes the sex of the unknown person is female. But wait. These two probabilities, 0.3505 and 0.6495, are close to but definitely not the same as the two probabilities, 0.3855 and 0.6145, shown in Figure 1. The reason for this discrepancy is that the demo program uses an important optional modification of basic Naive Bayes called Laplacian smoothing.

Laplacian Smoothing

If you refer to Figure 1, you’ll see that the count of training cases in which the person has occupation = construction and sex = female is 0. In the demo, the X values are (education, right, tall), which doesn’t include construction. But suppose X had been (construction, right, tall). In the computation of PP(female | X) it would be necessary to compute P(construction | female) = count(construction & female) / count(female), which would be 0, and which in turn would zero-out the entire partial probability. In short, it’s bad when a joint count is 0. The most common technique to avoid this situation is to simply add 1 to all joint counts. This has the feel of a hack but, in fact, has a solid mathematical basis. The technique is called add-one smoothing, which is a specific kind of Laplacian smoothing.

With Laplacian smoothing, if X = (education, right, tall) as in the previous section, P(male | X) and P(female | X) are calculated as follows:

P(education | male ) =
count(education & male) + 1 / count(male) + 3 = 3/27 = 0.1111
P(right | male) =
count(right & male) + 1 / count(male) + 3 = 18/27 = 0.6667
P(tall | male) =
count(tall & male) + 1 / count(male) + 3 = 5/27 = 0.1852
P(male) = 24/40 = 0.6000
P(education | female) =
count(education & female) + 1 / count(female) + 3 = 5/19 = 0.2632
P(right | female) =
count(right & female) + 1 / count(female) + 3 = 15/19 = 0.7895
P(tall | female) =
count(tall & female) + 1 / count(female) + 3 = 3/19 = 0.1579
P(female) = 16/40 = 0.4000

The partial probabilities are:

PP(male | X) = 0.1111 * 0.6667 * 0.1852 * 0.6000 = 0.008230
PP(female | X) = 0.2632 * 0.7895 * 0.1579 * 0.4000 = 0.013121

And so the two final probabilities are:

P(male | X) = 0.008230 / (0.008230 + 0.013121) = 0.3855
P(female | X) = 0.013121 / (0.008230 + 0.013121) = 0.6145

These are the values shown in the screenshot in Figure 1. Notice that 1 is added to each joint count but that 3 is added to denominators count(male) and count(female). The 3 is to some extent arbitrary in the sense that Laplacian smoothing doesn’t specify any particular value to be used. In this case, it’s the number of X attributes (occupation, dominance, height). This is the most common value to add to denominators of partial probabilities in Laplacian smoothing, but you may wish to experiment with other values. The value to add to the denominator is often given the symbol k in math literature on Naive Bayes. Also, notice that the priors, P(male) and P(female), are typically not modified in Naive Bayes Laplacian smoothing.

Overall Program Structure

The demo program shown running in Figure 1 is a single C# console application. The Main method, with some WriteLine statements removed, is listed in Figure 2.

Figure 2 Naive Bayes Program Structure

using System;
namespace NaiveBayes
{
  class Program
  {
    static Random ran = new Random(25); // Arbitrary
    static void Main(string[] args)
    {
      try
      {
        string[] attributes = new string[] { "occupation", "dominance",
          "height", "sex"};
        string[][] attributeValues = new string[attributes.Length][];
        attributeValues[0] = new string[] { "administrative",
          "construction", "education", "technology" };
        attributeValues[1] = new string[] { "left", "right" };
        attributeValues[2] = new string[] { "short", "medium", "tall" };
        attributeValues[3] = new string[] { "male", "female" };
        double[][] numericAttributeBorders = new double[1][];
        numericAttributeBorders[0] = new double[] { 64.0, 71.0 };
        string[] data = MakeData(40);
        for (int i = 0; i < 4; ++i)
          Console.WriteLine(data[i]);
        string[] binnedData = BinData(data, attributeValues,
          numericAttributeBorders);
        for (int i = 0; i < 4; ++i)
          Console.WriteLine(binnedData[i]);
        int[][][] jointCounts = MakeJointCounts(binnedData, attributes,
          attributeValues);
        int[] dependentCounts = MakeDependentCounts(jointCounts, 2);
        Console.WriteLine("Total male = " + dependentCounts[0]);
        Console.WriteLine("Total female = " + dependentCounts[1]);
        ShowJointCounts(jointCounts, attributeValues);
        string occupation = "education";
        string dominance = "right";
        string height = "tall";
        bool withLaplacian = true;
        Console.WriteLine(" occupation = " + occupation);
        Console.WriteLine(" dominance = " + dominance);
        Console.WriteLine(" height = " + height);
        int c = Classify(occupation, dominance, height, jointCounts,
          dependentCounts, withLaplacian, 3);
        if (c == 0)
          Console.WriteLine("\nData case is most likely male");
        else if (c == 1)
          Console.WriteLine("\nData case is most likely female");
        Console.WriteLine("\nEnd demo\n");
      }
      catch (Exception ex)
      {
        Console.WriteLine(ex.Message);
      }
    } // End Main
    // Methods to create data
    // Method to bin data
    // Method to compute joint counts
    // Helper method to compute partial probabilities
    // Method to classify a data case
  } // End class Program
}

The program begins by setting up the hardcoded X attributes occupation, dominance, and height, and the dependent attribute sex. In some situations you may prefer to scan your existing data source to determine the attributes, especially when the source is a data file with headers or a SQL table with column names. The demo program also specifies the nine categorical X attribute values: (administrative, construction, education, technology) for occupation; (left, right) for dominance; and (short, medium, tall) for height. In this example there are two dependent variable attribute values: (male, female) for sex. Again, you may want to programmatically determine attribute values by scanning your data.

The demo sets up hardcoded boundary values of 64.0 and 71.0 to bin the numeric height values so that values less than or equal to 64.0 are categorized as short; heights between 64.0 and 71.0 are medium; and heights greater than or equal to 71.0 are tall. When binning numeric data for Naive Bayes, the number of boundary values will be one less than the number of categories. In this example, the 64.0 and 71.0 were determined by scanning the training data for minimum and maximum height values (57.0 and 78.0), computing the difference, 21.0, and then computing interval size by dividing by number of height categories, 3, giving 7.0. In most situations, you’ll want to the determine boundary values for numeric X attributes programmatically rather than manually.

The demo program calls a helper method MakeData to generate somewhat random training data. MakeData calls helpers MakeSex, MakeOccupation, MakeDominance and MakeHeight. For example, these helpers generate data so that male occupations are more likely to be construction and technology, male dominance is more likely to be right, and male height is most likely to be between 66.0 and 72.0 inches.

The key methods called in Main are BinData to categorize height data; MakeJointCounts to scan binned data and compute the joint counts; MakeDependentCounts to compute total number of males and females; and Classify, which uses joint counts and dependent counts to perform a Naive Bayes classification.

Binning Data

Method BinData is listed in Figure 3. The method accepts an array of comma-delimited strings where each string looks like “education,left,67.5,male.” In many situations, you’ll be reading training data from a text file where each line is a string. The method uses String.Split to parse each string into tokens. Token[2] is the height. It’s converted from a string into type double using the double.Parse method. The numeric height is compared against the boundary values until the height’s interval is found, and then the corresponding height category as a string is determined. A result string is stitched together using the old tokens, comma delimiters and the new computed-height category string.

Figure 3 Method BinData for Categorizing Height

static string[] BinData(string[] data, string[][] attributeValues,
  double[][] numericAttributeBorders)
{
  string[] result = new string[data.Length];
  string[] tokens;
  double heightAsDouble;
  string heightAsBinnedString;
  for (int i = 0; i < data.Length; ++i)
  {
    tokens = data[i].Split(',');
    heightAsDouble = double.Parse(tokens[2]);
    if (heightAsDouble <= numericAttributeBorders[0][0]) // Short
      heightAsBinnedString = attributeValues[2][0];
    else if (heightAsDouble >= numericAttributeBorders[0][1]) // Tall
      heightAsBinnedString = attributeValues[2][2];
    else
      heightAsBinnedString = attributeValues[2][1]; // Medium
    string s = tokens[0] + "," + tokens[1] + "," + heightAsBinnedString +
      "," + tokens[3];
    result[i] = s;
  }
  return result;
}

It’s not a requirement to bin numeric data when performing Naive Bayes classification. Naive Bayes can deal with numeric data directly, but those techniques are outside the scope of this article. Binning data has the advantages of simplicity and avoiding the need to make any particular explicit assumptions about the mathematical distribution (such as Gaussian or Poisson) of the data. However, binning essentially loses information and does require you to determine and specify into how many categories to divide the data.

Determining Joint Counts

The key to Naive Bayes classification is computing joint counts. In the demo example, there are nine total independent X attribute values (administrative, construction, … tall) and two dependent attribute values (male, female), so a total of 9 * 2 = 18 joint counts must be computed and stored. My preferred approach is to store joint counts in a three-dimensional array int[][][] jointCounts. The first index indicates the independent X attribute; the second index indicates the independent X attribute value; and the third index indicates the dependent attribute value. For example, jointCounts[0][3][1] means attribute 0 (occupation), attribute value 3 (technology) and sex 1 (female), or in other words the value at jointCounts[0][3][1] is the count of training cases where occupation is technology and sex is female. Method MakeJointCounts is listed in Figure 4.

Figure 4 Method MakeJointCounts

static int[][][] MakeJointCounts(string[] binnedData, string[] attributes,
  string[][] attributeValues)
{
  int[][][] jointCounts = new int[attributes.Length - 1][][]; // -1 (no sex)
  jointCounts[0] = new int[4][]; // 4 occupations
  jointCounts[1] = new int[2][]; // 2 dominances
  jointCounts[2] = new int[3][]; // 3 heights
  jointCounts[0][0] = new int[2]; // 2 sexes for administrative
  jointCounts[0][1] = new int[2]; // construction
  jointCounts[0][2] = new int[2]; // education
  jointCounts[0][3] = new int[2]; // technology
  jointCounts[1][0] = new int[2]; // left
  jointCounts[1][1] = new int[2]; // right
  jointCounts[2][0] = new int[2]; // short
  jointCounts[2][1] = new int[2]; // medium
  jointCounts[2][2] = new int[2]; // tall
  for (int i = 0; i < binnedData.Length; ++i)
  {
    string[] tokens = binnedData[i].Split(',');
    int occupationIndex = AttributeValueToIndex(0, tokens[0]);
    int dominanceIndex = AttributeValueToIndex(1, tokens[1]);
    int heightIndex = AttributeValueToIndex(2, tokens[2]);
    int sexIndex = AttributeValueToIndex(3, tokens[3]);
    ++jointCounts[0][occupationIndex][sexIndex];
    ++jointCounts[1][dominanceIndex][sexIndex];
    ++jointCounts[2][heightIndex][sexIndex];
  }
  return jointCounts;
}

The implementation has many hardcoded values to make it easier to understand. For example, these three statements could be replaced by a single for loop that allocates space using Length properties in array attributeValues:

jointCounts[0] = new int[4][]; // 4 occupations
jointCounts[1] = new int[2][]; // 2 dominances
jointCounts[2] = new int[3][]; // 3 heights

Helper function AttributeValueToIndex accepts an attribute index and an attribute value string and returns the appropriate index. For example, AttributeValueToIndex(2, “medium”) returns the index of “medium” in the height attribute, which is 1.

The demo program uses a method MakeDependentCounts to determine the number of male and number of female data cases. There are several ways to do this. If you refer to Figure 1, you’ll observe that one approach is to add the number of joint counts of any of the three attributes. For example, the number of males is the sum of count(administrative & male), count(construction & male), count(education & male) and count(technology & male):

static int[] MakeDependentCounts(int[][][] jointCounts,
  int numDependents)
{
  int[] result = new int[numDependents];
  for (int k = 0; k < numDependents; ++k) 
  // Male then female
    for (int j = 0; j < jointCounts[0].Length; ++j)
    // Scanning attribute 0
      result[k] += jointCounts[0][j][k];
  return result;
}

Classifying a Data Case

Method Classify, shown in Figure 5, is short because it relies on helper methods.

Figure 5 Method Classify

static int Classify(string occupation, string dominance, string height,
  int[][][] jointCounts, int[] dependentCounts, bool withSmoothing,
  int xClasses)
{
  double partProbMale = PartialProbability("male", occupation, dominance,
    height, jointCounts, dependentCounts, withSmoothing, xClasses);
  double partProbFemale = PartialProbability("female", occupation, dominance,
    height, jointCounts, dependentCounts, withSmoothing, xClasses);
  double evidence = partProbMale + partProbFemale;
  double probMale = partProbMale / evidence;
  double probFemale = partProbFemale / evidence;
  if (probMale > probFemale) return 0;
  else return 1;
}

Method Classify accepts the jointCounts and dependentCounts arrays; a Boolean field to indicate whether or not to use Laplacian smoothing; and parameter xClasses, which in this example will be 3 because there are three independent variables (occupation, dominance, height). This parameter could also be inferred from the jointCounts parameter.

Method Classify returns an int that represents the index of the predicted dependent variable. Instead, you might want to return an array of probabilities for each dependent variable. Notice that the classification is based on probMale and probFemale, both of which are computed by dividing partial probabilities by the evidence value. You might want to simply omit the evidence term and just compare the values of the partial probabilities by themselves.

Method Classify returns the index of the dependent variable that has the largest probability. An alternative is to supply a threshold value. For example, suppose probMale is 0.5001 and probFemale is 0.4999. You may wish to consider these values too close to call and return a classification value representing “undetermined.”

Method PartialProbability does most of the work for Classify and is listed in Figure 6.

Figure 6 Method PartialProbability

static double PartialProbability(string sex, string occupation, string dominance,
  string height, int[][][] jointCounts, int[] dependentCounts,
  bool withSmoothing, int xClasses)
{
  int sexIndex = AttributeValueToIndex(3, sex);
  int occupationIndex = AttributeValueToIndex(0, occupation);
  int dominanceIndex = AttributeValueToIndex(1, dominance);
  int heightIndex = AttributeValueToIndex(2, height);
  int totalMale = dependentCounts[0];
  int totalFemale = dependentCounts[1];
  int totalCases = totalMale + totalFemale;
  int totalToUse = 0;
  if (sex == "male") totalToUse = totalMale;
  else if (sex == "female") totalToUse = totalFemale;
  double p0 = (totalToUse * 1.0) / (totalCases); // Prob male or female
  double p1 = 0.0;
  double p2 = 0.0;
  double p3 = 0.0;
  if (withSmoothing == false)
  {
    p1 = (jointCounts[0][occupationIndex][sexIndex] * 1.0) / totalToUse
    p2 = (jointCounts[1][dominanceIndex][sexIndex] * 1.0) / totalToUse;  
    p3 = (jointCounts[2][heightIndex][sexIndex] * 1.0) / totalToUse;     
  }
  else if (withSmoothing == true)
  {
    p1 = (jointCounts[0][occupationIndex][sexIndex] + 1) /
     ((totalToUse + xClasses) * 1.0); 
    p2 = (jointCounts[1][dominanceIndex][sexIndex] + 1) /
     ((totalToUse + xClasses) * 1.0 ;
    p3 = (jointCounts[2][heightIndex][sexIndex] + 1) /
     ((totalToUse + xClasses) * 1.0);
  }
  //return p0 * p1 * p2 * p3; // Risky if any very small values
  return Math.Exp(Math.Log(p0) + Math.Log(p1) + Math.Log(p2) + Math.Log(p3));
}

Method PartialProbability is mostly hardcoded for clarity. For example, there are four probability pieces, p0, p1, p2 and p3. You can make PartialProbability more general by using an array of probabilities where the size of the array is determined from the jointCounts array.

Notice that instead of returning the product of the four probability pieces, the method returns the equivalent Exp of the sum of the Log of each piece. Using log probabilities is a standard technique in machine-learning algorithms that’s used to prevent numeric errors that can occur with very small real numeric values.

Wrapping Up

The example presented here should give you a good foundation for adding Naive Bayes classification features to your .NET applications. Naive Bayes classification is a relatively crude technique, but it does have several advantages over more-sophisticated alternatives such as neural network classification, logistic regression classification and support vector machine classification. Naive Bayes is simple, relatively easy to implement and scales well to very large data sets. And Naive Bayes easily extends to multinomial classification problems—those with three or more dependent variables.


Dr. James McCaffrey works for Volt Information Sciences Inc., where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. He’s the author of “.NET Test Automation Recipes” (Apress, 2006), and can be reached at jammc@microsoft.com.

Thanks to the following Microsoft technical expert for reviewing this article: Rich Caruana