How to measure False Positive rates

Article
10/31/2012

As someone who is in charge of our spam filtering here in Microsoft Forefront (i.e., I’m on the spam team and one of my tasks is to improve the service, but it’s not me all by myself), there are two critical pieces of information:

What’s our spam catch rate?
What’s our false positive rate?

I’m talk about measuring spam catch rates in a future post. But today, I’d like to look at false positives. How do you measure how much good mail you catch but should have allowed through?

There are a few ways to do this. Hotmail has a bunch of graders who receive a copy of a message every so often of their own mail stream that says “Is this mail spam or non-spam?” The graders then give the verdict and Hotmail compares it to what their filter’s actual verdict was. If the grader says “non-spam” but Hotmail’s filter said “spam”, then they have a false positive. With a big enough set of graders, this process provides a reasonably effective number.

This doesn’t work in an enterprise environment like what we filter because (a) professional workers don’t want to be bothered to do this every day, and (b) there are privacy issues. What works in Hotmail doesn’t work for us.

A second way to do it is with an independent test. There are organizations like Virus Bulletin that will test the filtering effectiveness of spam filters. To do this, they run honeypot spam traffic through filters and they also run legitimate mail through the filter. The problem with this is that the mail volumes are not very large and the process requires some manual validation afterwards.

One of my requirements is that a measurement must:

Be automatic
Be repeatable
Have high volume
Not depend upon examining the mail afterwards

It’s hard to get all of these, especially #3. Numbers only mean something when you have lots of them.

One idea that I have is to use IP whitelists. Every IP on this whitelist is supposed to be a good sender who never sends spam – that’s why they are on the whitelist. If you get mail from this IP and it is marked as spam, then you have a false positive.

To do this, record the spam/non-spam stats for each IP address each day that sends you email and then check for the intersection of your total IPs vs. the ones on the whitelist. If any IPs, according to your stats, have messages marked as spam, then those are the FPs. Use that as the FP rate. For example:

Date: Oct 31, 2012

Whitelist_IP_1: Spam 0 Non-spam 68
Whitelist_IP_2: Spam 2 Non-spam 97
Whitelist_IP_3: Spam 38 Non-spam 122
Whitelist_Total: Spam: 40 Non-spam 287

FP rate: 13.9%

This approach satisfies all of my requirements:

It can be automated. It is easy to automate the cross-section and analysis of two different lists. There is no human required to get in between this process.
It’s repeatable. Simply pull the list whenever you wish, and check against the total stats, and your results are popped out for you. The methodology never changes and is consistent throughout.
It has high volume. Manual analysis prevents high volume. It’s also hard to generate that much mail manually. However, if your whitelist is big enough, you can generate a statistically significant amount of traffic.
It’s (reasonably) reliable. Since someone else has populated the known whitelist of good senders, and they are vouching for its cleanliness, there is no need to check afterwards if the mail really is legitimate or the feed is polluted.

Of course, there are some drawbacks to this approach as well:

Is the list really clean? Even though the IPs are on the whitelist, are they actually sending good mail? What if the list is polluted?

The way to get around this is to pull multiple different independently generated whitelists. If one of the lists is an outlier, then the list can be weighted differently or excluded altogether. But if multiple lists are saying the same thing, then you can be reasonably sure that the data you are gathering is reflective of reality.
Are the lists representative of real life? Not every good piece of email comes from a good IP address. In fact, there are lots of IP addresses that cannot be whitelisted that send out good mail.

This is alleviated by taking IP lists that are big enough to generate mail in large volumes, as well as multiple lists that are populated independently. If you have enough data points, they average out. Outliers can be excluded or weighted lower.

Using whitelists in this manner is a quick-and-dirty way to measure the effectiveness of a spam filter (assuming that you don’t give mail on the whitelist a free pass to the inbox; you should filter it in parallel). It’s not a perfect way to do it, but it’s fast and efficient. For internal purposes, it’s probably the best method I can think up for a ballpark estimate of how good a filter is.

Share via

How to measure False Positive rates

Additional resources