Exchange Online increases its URL filtering

One of the ways in which Exchange Online detects spam, malware, and phishing is through URL filtering. We use a variety of sources, you can find them here:

https://technet.microsoft.com/en-us/library/dn458545(v=exchg.150).aspx

We use URL reputation lists in the following way (including but not limited to):

  1. At time-of-scan, if a message contains a URL that is on one of the lists we use, a weight is added to the message. This weight is added to all the other features of a message to determine a message's spam/non-spam status, and also sets the Spam Confidence Level (SCL). Different lists have different weights.
    .
  2. The URL lists are also used as inputs into our machine learning algorithms to see if there are any similarities between URLs, and between messages with URLs. This is so our filters can make predictions in the future about messages with URLs that are not yet on any of our lists but may be in the future. That is, we are trying to pre-emptively determine that a message containing a malicious URL is spam, malware, or phishing prior to the URL being added to a reputation list.
    .
  3. Our Safe Links feature, which is part of Office 365's Advanced Threat Protection, uses mostly (but not completely) the same set of URLs in the spam filter that it does for blocking when a user clicks on a link that we think is malicious (when they have Safe Links enabled).
    .

We publish all the URL lists that we use at the link above. However, going forward, we may or may not publish every list.

For you see, we recently expanded the number of URL sources we pull from. Whereas before with our lists we were going for volume, nowadays adding more and more URL lists does not necessarily give you better coverage. Just stuffing more and more links into a list gives diminishing returns because spammers and phishers churn through them so rapidly. The result is a list that is 10 million entries, 99% of which are never seen.

Instead, we've been looking to shore up our lists by quality. We are not necessarily targeting the size of the list, but rather are diversifying based upon origin.

- How frequently does it update?

- What sources does it come from?

- Do they overlap with our existing lists? (this is an important factor)

- Does it overlap much with another list we are evaluating?

- How much additional value does it generate relative to the price the vendor wants to charge us?

- Does is specifically target phishing?

- Does it specifically target malware? These last two are important because we can use some of these lists that target those two types of spam as part of our Safety Tips feature.

The way we try out a new list is to pull it down from the source, push it out to production, and put it in pass-through mode. We observe how much overlap there is between the contents of the list and our own traffic. We then start pushing up the weight of the list but only apply it to time-of-scan. We then watch for false positives. We continue to push up the aggressiveness of the list until it's as far as it's going to go, at which point we enable it for machine learning and also for Safe Links. If we get false positives, we either decrease the aggressiveness of the weight of the list, figure out the root cause of the false positives (i.e., syntax errors in the list, problems with the downloaders), or stop using the list altogether.

The goal of this is to get better protection for our customers while avoiding disruption to legitimate mail flow. That's a balancing act and usually takes about four weeks from when we start to when we complete.

Anyway, as I was saying earlier, we've included several new lists over the past few weeks; some of them are being used in #1-3 above, some others are only at #1, and a couple more are at stage 0. But whereas with our previous lists we revealed what they are, we don't necessarily plan to identify the new ones. This is for a couple of reasons:

  1. The sources have asked not to be identified
    .
  2. By revealing which sources we use, a phisher can try to game the system and we are trying to prevent that

We still manage the false positives by doing cost/benefit analysis on the sources and would stop using the ones that do not provide benefit relative to the negative mailflow disruption they might cause.

So there you go; that's what's new in Exchange Online Protection over the past four weeks. We've incrementally started making your experience better, all in an effort to ensure you have the best email protection possible.