Megosztás a következőn keresztül:



February 2013

Volume 28 Number 02

Data Clustering - Detecting Abnormal Data Using k-Means Clustering

By James McCaffrey

Consider the problem of identifying abnormal data items in a very large data set, for example, identifying potentially fraudulent credit-card transactions, risky loan applications and so on. One approach to detecting abnormal data is to group the data items into similar clusters and then seek data items within each cluster that are different in some sense from other data items within the cluster.

There are many different clustering algorithms. One of the oldest and most widely used is the k-means algorithm. In this article I’ll explain how the k-means algorithm works and present a complete C# demo program. There are many existing standalone data-clustering tools, so why would you want to create k-means clustering code from scratch? Existing clustering tools can be difficult or impossible to integrate into a software system, they might not be customizable to deal with unusual scenarios, and the tools might have copyright or other intellectual property issues. After reading this article you’ll be able to experiment with k-means clustering and have the base knowledge to add clustering functionality to a .NET application.

The best way to get a feel for what k-means clustering is and to see where I’m headed in this article is to take a look at Figure 1. The demo program begins by creating a dummy set of 20 data items. In clustering terminology, data items are sometimes called tuples. Each tuple here represents a person and has two numeric attribute values, a height in inches and a weight in pounds. One of the limitations of the k-means algorithm is that it applies only in cases where the data tuples are completely numeric.

Clustering Using K-Means
Figure 1 Clustering Using k-Means

The dummy data is loaded into an array in memory. Next, the number of clusters is set to three. Although there are advanced clustering techniques that can suggest the optimal number of clusters to use, in general data clustering is an exploratory process and the best number of clusters to use is typically found through trial and error. As you’ll see shortly, k-means clustering is an iterative process. The demo program has a variable maxCount, which is used to limit the number of times the main clustering loop will execute. Here that value is arbitrarily set to 30.

Next, behind the scenes, the demo program uses the k-means algorithm to place each data tuple into one of three clusters. There are many ways to encode a clustering. In this case, a clustering is defined by an array of int where the array index represents a tuple, and the associated array value represents the 0-based cluster ID. So, in Figure 1, tuple 0 (65.0, 220.0) is assigned to cluster 0, tuple 1 (73.0, 160.0) is assigned to cluster 1, tuple 2 (59.0, 110.0) is assigned to cluster 2, tuple 3 (61.0, 120.0) is assigned to cluster 2 and so on. Notice there are eight tuples assigned to cluster 0, five tuples assigned to cluster 1, and seven tuples assigned to cluster 2.

Next, the demo program displays the data, grouped by cluster. If you examine the clustered data you’ll see that cluster 0 might be called the heavy people cluster, cluster 1 might be called the tall people cluster, and cluster 2 might be called the short people cluster. The demo program concludes by analyzing the tuples assigned to cluster 0 and determines that by some criterion, tuple 5 (67.0, 240.0) is the most abnormal tuple.

In the sections that follow, I’ll walk you through the code that produced the screenshot in Figure 1 so that you’ll be able to modify this code to meet your own needs. This article assumes you have at least intermediate-level programming skill with a C-family language, but does not assume you know anything about data clustering. I coded the demo program using C#, but I used a non-OOP style so you shouldn’t have too much difficulty refactoring the demo to another language if you wish. I present all the source code for the demo program in this article. The source code is also available at msdn.com/magazine/msdnmag0213.

The k-Means Algorithm

In principle, at least, the k-means algorithm is quite simple. But as you’ll see, some of the implementation details are a bit tricky. The central concept in the k-means algorithm is the centroid. In data clustering, the centroid of a set of data tuples is the one tuple that’s most representative of the group. The idea is best explained by example. Suppose you have three height-weight tuples similar to those shown in Figure 1:

[a] (61.0, 100.0)
[b] (64.0, 150.0)
[c] (70.0, 140.0)

Which tuple is most representative? One approach is to compute a mathematical average (mean) tuple, and then select as the centroid the tuple that is closest to that average tuple. So, in this case, the average tuple is:

[m] = ((61.0 + 64.0 + 70.0) / 3, (100.0 + 150.0 + 140.0) / 3)
    = (195.0 / 3, 390.0 / 3)
    = (65.0, 130.0)

And now, which of the three tuples is closest to (65.0, 130.0)? There are several ways to define closest. The most common approach, and the one used in the demo program, is to use the Euclidean distance. In words, the Euclidean distance between two tuples is the square root of the sum of the squared differences between each component of the tuples. Again, an example is the best way to explain. The Euclidean distance between tuple (61.0, 100.0) and the average tuple (65.0, 130.0) is:

dist(m,a) = sqrt((65.0 - 61.0)^2 + (130.0 - 100.0)^2)
          = sqrt(4.0^2 + 30.0^2)
          = sqrt(16.0 + 900.0)
          = sqrt(916.0)
          = 30.27

Similarly:

dist(m,b) = sqrt((65.0 - 64.0)^2 + (130.0 - 150.0)^2)
          = 20.02
dist(m,c) = sqrt((65.0 - 70.0)^2 + (130.0 - 140.0)^2)
          = 11.18

Because the smallest of the three distances is the distance between the math average and tuple [c], the centroid of the three tuples is tuple [c]. You might wish to experiment with the demo program by using different definitions of the distance between two tuples to see how those affect the final clustering produced.

With the notion of a cluster centroid established, the k-means algorithm is relatively simple. In pseudo-code:

assign each tuple to a randomly selected cluster
compute the centroid for each cluster
loop until no improvement or until maxCount
  assign each tuple to best cluster
   (the cluster with closest centroid to tuple)
  update each cluster centroid
   (based on new cluster assignments)
end loop
return clustering

If you search the Web, you can find several good online animations of the k-means algorithm in action. The image in Figure 2 shows the clustering produced by the demo program. The circled data item in each cluster is the cluster centroid.

Clustered Data and Centroids
Figure 2 Clustered Data and Centroids

Overall Program Structure

The overall program structure for the demo shown in Figure 1, with a few minor edits, is listed in Figure 3. I used Visual Studio 2010 to create a new C# console application named ClusteringKMeans; any recent version of Visual Studio should work, too. In the Solution Explorer window I renamed file Program.cs to ClusteringKMeans­Program.cs, which automatically renamed the template-generated class. I removed unneeded using statements at the top of the file.

Figure 3 Overall Program Structure

using System;
namespace ClusteringKMeans
{
  class ClusteringKMeansProgram
  {
    static void Main(string[] args)
    {
      try
      {
        Console.WriteLine("\nBegin outlier data detection demo\n");
        Console.WriteLine("Loading all (height-weight) data into memory");
        string[] attributes = new string[] { "Height", "Weight" };
        double[][] rawData = new double[20][];
        rawData[0] = new double[] { 65.0, 220.0 };
        rawData[1] = new double[] { 73.0, 160.0 };
        rawData[2] = new double[] { 59.0, 110.0 };
        rawData[3] = new double[] { 61.0, 120.0 };
        rawData[4] = new double[] { 75.0, 150.0 };
        rawData[5] = new double[] { 67.0, 240.0 };
        rawData[6] = new double[] { 68.0, 230.0 };
        rawData[7] = new double[] { 70.0, 220.0 };
        rawData[8] = new double[] { 62.0, 130.0 };
        rawData[9] = new double[] { 66.0, 210.0 };
        rawData[10] = new double[] { 77.0, 190.0 };
        rawData[11] = new double[] { 75.0, 180.0 };
        rawData[12] = new double[] { 74.0, 170.0 };
        rawData[13] = new double[] { 70.0, 210.0 };
        rawData[14] = new double[] { 61.0, 110.0 };
        rawData[15] = new double[] { 58.0, 100.0 };
        rawData[16] = new double[] { 66.0, 230.0 };
        rawData[17] = new double[] { 59.0, 120.0 };
        rawData[18] = new double[] { 68.0, 210.0 };
        rawData[19] = new double[] { 61.0, 130.0 };
        Console.WriteLine("\nRaw data:\n");
        ShowMatrix(rawData, rawData.Length, true);
        int numAttributes = attributes.Length;
        int numClusters = 3;
        int maxCount = 30;
        Console.WriteLine("\nk = " + numClusters + " and maxCount = " + maxCount);
        int[] clustering = Cluster(rawData, numClusters, numAttributes, maxCount);
        Console.WriteLine("\nClustering complete");
        Console.WriteLine("\nClustering in internal format: \n");
        ShowVector(clustering, true);
        Console.WriteLine("\nClustered data:");
        ShowClustering(rawData, numClusters, clustering, true);
        double[] outlier = Outlier(rawData, clustering, numClusters, 0);
        Console.WriteLine("Outlier for cluster 0 is:");
        ShowVector(outlier, true);
        Console.WriteLine("\nEnd demo\n");
      }
      catch (Exception ex)
      {
        Console.WriteLine(ex.Message);
      }
    } // Main
    // 14 short static method definitions here
  }
}

For simplicity I used a static method approach and removed all error-checking. The first part of the demo code sets up the height and weight data to be clustered. Because there are only 20 tuples, I hardcoded the data and stored the data in memory in an array named rawData. Typically, your data will be stored in a text file or SQL table. In those cases you’ll have to write a helper function to load the data into memory. If your data source is too large to fit into machine memory, you’ll have to modify the demo code to iterate through an external data source rather than a data array.

After setting up the raw data, the demo program calls helper function ShowMatrix to display the data. Next, variables num­Attributes, numClusters, and maxCount are assigned values of 2 (height and weight), 3 and 30, respectively. Recall maxCount limits the number of iterations in the main algorithm processing loop. The k-means algorithm tends to converge quickly, but you might have to experiment a bit with the value of maxCount.

All the clustering work is performed by method Cluster. The method returns an int array that defines how each tuple is assigned to one cluster. After finishing, the demo program displays the encoded clustering and also displays the raw data, grouped according to cluster.

The demo program concludes by analyzing the clustered data for outlier, possibly abnormal, tuples using method Outliers. That method accepts a cluster ID and returns the values of the data tuple that’s the farthest (as measured by Euclidean distance) from the cluster centroid (most representative tuple). In this case, for cluster 0, the heavy person cluster, the outlier tuple is (67.0, 240.0), the heaviest person.

Computing Cluster Centroids

Recall that a cluster centroid is a tuple that is most representative of the tuples assigned to a cluster, and that one way to determine a cluster centroid is to compute a math average tuple and then find the one tuple that’s closest to the average tuple. Helper method UpdateMeans computes the math average tuple for each cluster and is listed in Figure 4.  

Figure 4 Method UpdateMeans

static void UpdateMeans(double[][] rawData, int[] clustering,
  double[][] means)
{
  int numClusters = means.Length;
  for (int k = 0; k < means.Length; ++k)
    for (int j = 0; j < means[k].Length; ++j)
      means[k][j] = 0.0;
  int[] clusterCounts = new int[numClusters];
  for (int i = 0; i < rawData.Length; ++i)
  {
    int cluster = clustering[i];
    ++clusterCounts[cluster];
    for (int j = 0; j < rawData[i].Length; ++j)
      means[cluster][j] += rawData[i][j];
  }
  for (int k = 0; k < means.Length; ++k)
    for (int j = 0; j < means[k].Length; ++j)
      means[k][j] /= clusterCounts[k]; // danger
  return;
}

Method UpdateMeans assumes that an array of arrays named means already exists, as opposed to creating the array and then returning it. Because array means is assumed to exist, you might want to make it a ref parameter. Array means is created using helper method Allocate:

static double[][] Allocate(int numClusters, int numAttributes)
{
  double[][] result = new double[numClusters][];
  for (int k = 0; k < numClusters; ++k)
    result[k] = new double[numAttributes];
  return result;
}

The first index in the means array represents a cluster ID and the second index indicates the attribute. For example, if means[0][1] = 150.33 then the average of the weight (1) values of the tuples in cluster 0 is 150.33.

Method UpdateMeans first zeros out the existing values in array means, then iterates through each data tuple and tallies the count of tuples in each cluster and accumulates the sums for each attribute, and then divides each accumulated sum by the appropriate cluster count. Notice that the method will throw an exception if any cluster count is 0, so you might want to add an error-check.

Method ComputeCentroid (listed in Figure 5) determines the centroid values—the values of the one tuple that’s closest to the average tuple values for a given cluster.

Figure 5 Method ComputeCentroid

static double[] ComputeCentroid(double[][] rawData, int[] clustering,
  int cluster, double[][] means)
{
  int numAttributes = means[0].Length;
  double[] centroid = new double[numAttributes];
  double minDist = double.MaxValue;
  for (int i = 0; i < rawData.Length; ++i) // walk thru each data tuple
  {
    int c = clustering[i]; 
    if (c != cluster) continue;
    double currDist = Distance(rawData[i], means[cluster]);
    if (currDist < minDist)
    {
      minDist = currDist;
      for (int j = 0; j < centroid.Length; ++j)
        centroid[j] = rawData[i][j];
    }
  }
  return centroid;
}

Method ComputeCentroid iterates through each tuple in the data set, skipping tuples that aren’t in the specified cluster. For each tuple in the specified cluster, the Euclidean distance between the tuple and the cluster mean is calculated using helper method Distance. The tuple values that are closest (having the smallest distance) to the mean values are stored and returned.

Method UpdateCentroids calls ComputeCentroid for each cluster to give the centroids for all clusters:

static void UpdateCentroids(double[][] rawData, int[] clustering,
  double[][] means, double[][] centroids)
{
  for (int k = 0; k < centroids.Length; ++k)
  {
    double[] centroid = ComputeCentroid(rawData, clustering, k, means);
    centroids[k] = centroid;
  }
}

Method UpdateCentroids assumes that an array of arrays named centroids exists. Array centroids is very similar to array means: The first index represents a cluster ID and the second index indicates the data attribute.

To summarize, each cluster has a centroid, which is the most representative tuple in the cluster. Centroid values are computed by finding the one tuple in each cluster that’s closest to the average tuple (the mean) in each cluster. Each data tuple is assigned to the cluster whose cluster centroid is closest to the tuple.

The Distance Function and Data Normalization

Method ComputeCentroid calls a Distance method to determine which data tuple is closest to a cluster mean. As described earlier, the most common way to measure distance from tuples to means is to use Euclidean distance:

static double Distance(double[] tuple, double[] vector)
{
  double sumSquaredDiffs = 0.0;
  for (int j = 0; j < tuple.Length; ++j)
    sumSquaredDiffs += Math.Pow((tuple[j] - vector[j]), 2);
  return Math.Sqrt(sumSquaredDiffs);
}

You might want to consider alternative ways to define distance. A very common option is to use the sum of the absolute values of the differences between each component. Because Euclidean distance squares differences, larger differences are weighted much more heavily than smaller differences.

Another important factor related to the choice of distance function in the k-means clustering algorithm is data normalization. The demo program uses raw, un-normalized data. Because tuple weights are typically values such as 160.0 and tuple heights are typically values like 67.0, differences in weights have much more influence than differences in heights. In many situations, in addition to exploring clustering on raw data, it’s useful to normalize the raw data before clustering. There are many ways to normalize data. A common technique is to compute the mean (m) and standard deviation (sd) for each attribute, then for each attribute value (v) compute a normalized value nv = (v-m)/sd.

Assigning Each Tuple to a Cluster

With a method to compute the centroid of each cluster in hand, it’s possible to write a method to assign each tuple to a cluster. Method Assign is listed in Figure 6.

Figure 6 Method Assign

static bool Assign(double[][] rawData, 
  nt[] clustering, double[][] centroids)
{
  int numClusters = centroids.Length;
  bool changed = false;
  double[] distances = new double[numClusters];
  for (int i = 0; i < rawData.Length; ++i)
  {
    for (int k = 0; k < numClusters; ++k)
      distances[k] = Distance(rawData[i], centroids[k]);
    int newCluster = MinIndex(distances);
    if (newCluster != clustering[i])
    {
      changed = true;
      clustering[i] = newCluster;
    }
  }
  return changed;
}

Method Assign accepts an array of centroid values and iterates through each data tuple. For each data tuple, the distance to each of the cluster centroids is computed and stored in a local array named distances, where the index of the array represents a cluster ID. Then helper method MinIndex determines the index in array distances that has the smallest distance value, which is the cluster ID of the cluster that has centroid closest to the tuple.

Here’s helper method MinIndex:

static int MinIndex(double[] distances)
{
  int indexOfMin = 0;
  double smallDist = distances[0];
  for (int k = 0; k < distances.Length; ++k)
  {
    if (distances[k] < smallDist)
    {
      smallDist = distances[k]; indexOfMin = k;
    }
  }
  return indexOfMin;
}

In Assign, if the computed cluster ID is different from the existing cluster ID stored in array clustering, array clustering is updated and a Boolean flag to indicate that there has been at least one change in the clustering is toggled. This flag will be used to determine when to stop the main algorithm loop—when the maximum number of iterations is exceeded or when there’s no change in the clustering.

This implementation of the k-means algorithm assumes that there’s always at least one data tuple assigned to each cluster. As given in Figure 6, method Assign does not prevent a situation where a cluster has no tuples assigned. In practice, this usually isn’t a problem. Preventing the error condition is a bit tricky. The approach I generally use is to create an array named centroidIndexes that works in conjunction with array centroids. Recall that array centroids holds centroid values, for example (61.0, 120.0) is the centroid for cluster 2 in Figure 2. Array centroidIndexes holds the associated tuple index, for example [3]. Then in the Assign method, the first step is to assign to each cluster the data tuple that holds the centroid values, and only then does the method iterate through each remaining tuple and assign each to a cluster. This approach guarantees that every cluster has at least one tuple.

The Cluster Method

Method Cluster, listed in Figure 7, is the high-level routine that calls all the helper and sub-helper methods to actually perform the data clustering.

Figure 7 The Cluster Method

static int[] Cluster(double[][] rawData, int numClusters,
  int numAttributes,  int maxCount)
{
  bool changed = true;
  int ct = 0;
  int numTuples = rawData.Length;
  int[] clustering = InitClustering(numTuples, numClusters, 0);
  double[][] means = Allocate(numClusters, numAttributes);
  double[][] centroids = Allocate(numClusters, numAttributes);
  UpdateMeans(rawData, clustering, means);
  UpdateCentroids(rawData, clustering, means, centroids);
  while (changed == true && ct < maxCount)
  {
    ++ct;
    changed = Assign(rawData, clustering, centroids);
    UpdateMeans(rawData, clustering, means);
    UpdateCentroids(rawData, clustering, means, centroids);
  }
  return clustering;
}

The main while loop repeatedly assigns each data tuple to a cluster, computes the new tuple means for each cluster, then uses the new means to compute the new centroid values for each cluster. The loop exits when there’s no change in cluster assignment or some maximum count is reached. Because the means array is used only to compute centroids, you might want to refactor Cluster by placing the call to UpdateMeans inside method UpdateCentroids.

Before kicking the processing loop off, the clustering array is initialized by method InitClustering:

static int[] InitClustering(int numTuples, 
  int numClusters, int randomSeed)
{
  Random random = new Random(randomSeed);
  int[] clustering = new int[numTuples];
  for (int i = 0; i < numClusters; ++i)
    clustering[i] = i;
  for (int i = numClusters; i < clustering.Length; ++i)
    clustering[i] = random.Next(0, numClusters);
  return clustering;
}

The InitClustering method first assigns tuples 0 through num­Clusters-1 to clusters 0 through numClusters-1, respectively, so that every cluster will start with at least one tuple assigned. The remaining tuples are assigned to a randomly selected cluster.

A somewhat surprising amount of research has been done on k-means clustering initialization and you may want to experiment with alternatives to the approach given here. In many cases, the final clustering produced by the k-means algorithm depends on how the clustering is initialized.

Looking for Abnormal Data

One way to use data clustering is to simply explore different clusterings and look for unexpected or surprising results. Another possibility is to look for unusual data tuples within a cluster. The demo program checks cluster 0 to find the tuple in that cluster that’s farthest from the cluster centroid using a method named Outlier, which is listed in Figure 8.

Figure 8 The Outlier Method

static double[] Outlier(double[][] rawData, int[] clustering,
  int numClusters, int cluster)
{
  int numAttributes = rawData[0].Length;
  double[] outlier = new double[numAttributes];
  double maxDist = 0.0;
  double[][] means = Allocate(numClusters, numAttributes);
  double[][] centroids = Allocate(numClusters, numAttributes);
  UpdateMeans(rawData, clustering, means);
  UpdateCentroids(rawData, clustering, means, centroids);
  for (int i = 0; i < rawData.Length; ++i)
  {
    int c = clustering[i];
    if (c != cluster) continue;
    double dist = Distance(rawData[i], centroids[cluster]);
    if (dist > maxDist)
    {
      maxDist = dist;
      Array.Copy(rawData[i], outlier, rawData[i].Length);
    }
  }
  return outlier;
}

After initializing means and centroids arrays, method Outlier iterates through each tuple in the specified cluster and computes the Euclidean distance from the tuple to the cluster centroid, then returns the values of the tuple that has the greatest distance to the centroid values. A minor alternative for you to consider is to return the index of the farthest data tuple.

There are many other ways you can examine clustered data for abnormalities. For example, you might want to determine the average distance between each tuple and its assigned cluster centroid, or you might want to examine the distances of the cluster centroids from each other.

Display Routines

For the sake of completeness, here are some simplified display routines. The code download has slightly fancier versions. If you use these simplified routines, you’ll have to modify their calls in the Main method. To display raw data, means and centroids you can use:

static void ShowMatrix(double[][] matrix)
{
  for (int i = 0; i < numRows; ++i)
  {
    Console.Write("[" + i.ToString().PadLeft(2) + "]  ");
    for (int j = 0; j < matrix[i].Length; ++j)
      Console.Write(matrix[i][j].ToString("F1") + "  ");
    Console.WriteLine("");
  }
}

To display the clustering array you can use:

static void ShowVector(int[] vector)
{
  for (int i = 0; i < vector.Length; ++i)
    Console.Write(vector[i] + " ");
  Console.WriteLine("");
}

To display an outlier’s values you can use:

static void ShowVector(double[] vector)
{
  for (int i = 0; i < vector.Length; ++i)
    Console.Write(vector[i].ToString("F1") + " ");
  Console.WriteLine("");
}

And to display raw data grouped by cluster you can use:

static void ShowClustering(double[][] rawData, 
  int numClusters, int[] clustering)
{
  for (int k = 0; k < numClusters; ++k) // Each cluster
  {
    for (int i = 0; i < rawData.Length; ++i) // Each tuple
      if (clustering[i] == k)
      {
        for (int j = 0; j < rawData[i].Length; ++j)
          Console.Write(rawData[i][j].ToString("F1") + " ");
        Console.WriteLine("");
      }
    Console.WriteLine("");
  }
}

Wrapping Up

Data clustering is closely related to and sometimes confused with data classification. Clustering is an unsupervised technique that groups data items together without any foreknowledge of what those groups might be. Clustering is typically an exploratory process. Classification, in contrast, is a supervised technique that requires the specification of known groups in training data, after which each data tuple is placed into one of these groups. Classification is typically used for prediction purposes.

The code and explanation presented in this article should give you enough information to experiment with k-means data clustering, or to create a fully customizable standalone clustering tool, or to add clustering features to a .NET application without relying on any external dependencies. There are many other clustering algorithms in addition to k-means and I’ll present some of these in future MSDN Magazine articles, including data entropy minimization, category utility and Naive Bayes inference.


Dr. James McCaffrey works for Volt Information Sciences Inc., where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. He’s the author of “.NET Test Automation Recipes” (Apress, 2006), and can be reached at jammc@microsoft.com.

Thanks to the following technical expert for reviewing this article: Darren Gehring