How Office 365 does automatic DKIM key rotation
As you can see from one of my other posts, Office 365 now lets you sign your outbound email with DKIM signatures.
One of the key differences between how we do it and how almost every other service does it is that instead of requiring the customer to publish the public key in DNS (and we sign with the private key), we require the customer to publish a CNAME to the public key in DNS which delegates their namespace to Office 365. Furthermore, we require customers to publish not one, but two CNAMEs.
For example, here is Microsoft’s set of keys:
selector1._domainkey.microsoft.com. 3600 IN CNAME selector1-microsoft-com._domainkey.microsoft.onmicrosoft.com. (Microsoft publishes this in DNS)
selector1-microsoft-com._domainkey.microsoft.onmicrosoft.com. 3600 IN TXT "v=DKIM1\; k=rsa\; p=<public key#1 >n=1024,1435867504,1" (Office 365 publishes this in its DNS)
selector2._domainkey.microsoft.com. 3600 IN CNAME selector2-microsoft-com._domainkey.microsoft.onmicrosoft.com. (Microsoft publishes this in DNS)
selector2-microsoft-com._domainkey.microsoft.onmicrosoft.com. 3600 IN TXT "v=DKIM1\; k=rsa\; p=<public key #2> n=1024,1435867505,1" (Office 365 publishes this in its DNS)
Why do we require a CNAME? And why two of them?
The reason for the CNAME is:
It makes it more difficult for anyone to mess it up when copy/pasting the public key. With a standardized CNAME, we can mechanically predict exactly what will work and what will not. We know that when we publish the public key, it will be correct 100% of the time. When a customer sets up the public key, it will be correct less than 100% of the time.
Setting up DKIM can be tricky. When copy/pasting or creating a public key, there are many ways to mess it up – missing semi-colons; missing a single character in the key; missing the v=DKIM1 which is option but SHOULD be in the record, and so forth.
By doing the CNAME, we skip all of this.
With a CNAME, we can rotate the keys whenever we need to. Because we control the public key and private key, when we need to rotate we simply update the private key on the backend and the public key in our DNS servers. The CNAME in the customer’s DNS record still points to us, but what it points to is a new key.
The customer never has to take action. And let’s face it – if you’re an Office 365 customer and we asked you to rotate your DKIM key, you probably wouldn’t do it. With our service, we’ve built key rotation into it.
But why two CNAMEs?
The reason is for seamless DKIM key rotation.
Suppose that we want to rotate DKIM keys on Jan 21, 2016, 10 am Pacific. At that time, we rotate the keys.
However, messages sent from our service at 9:59 am will still be signed with the old key, and if they arrive at their destination at 10:22 am, they can’t be validated because the new key is not in DNS.
Or, DNS may take time to update and replicate worldwide.
Or, because key distribution within our service is not instantaneous, there may be conditions when some messages are signed with the old key and some with the new key.
The end result is the same – during rotation some messages cannot be DKIM-validated for a period of time.
Enter two CNAMEs to the rescue.
On Jan 14, 2016, 10 am, publish (if it’s not already there) public key #2 at CNAME #2.
On Jan 21, 2016, 10 am, we start signing with public key #2. There will still be messages in transit that are signed with key #1. Because email is a store-and-forward system, this state of affairs could continue for a while.
On Jan 28, 2016, 10 am, we assume that all messages signed with public key #1 are out of the system and everyone else’s system, too. After all, we’ve been signing all mail with public key #2 for a week. We then update public key #1. Messages sent more than a week ago cannot be verified anymore. But it doesn’t matter because a message doesn’t need to be reverified. And if it does… it’s too late.
The next time we want to update, say Jan 21, 2017, we repeat the process. We flip back to key #1 (which contains a rotated key) and then update public key #2 a week later. By this time, all the old messages are out of the system and hopefully everyone else’s system.
In this manner, we’ve enabled customers to automatically rotate their DKIM keys without doing anything, and it’s all automatic. Furthermore, it fails over seamlessly without downtime during key rotation. It takes DNS a couple of days to replicate everywhere, and it takes our own system a few minutes to up to a couple of hours (in the worst case) to update everywhere. But it doesn’t matter because we’ve built replication delay into the system. Our rotation happens outside of those windows.
In the picture below, DKIM key #1 is in red and DKIM key #2 is in blue. The horizontal lines indicate messages signed with the corresponding DKIM key.
From Nov 1, 2015 to Jan 21, 2016, all messages are red.
After Jan 21, 2016, some messages are red but they fall off after a few days, and the rest of the messages are in blue. This is the transition period.
A week later, all messages are in blue.
We rotate the red key when we are confident all messages are now blue.
This flipping back and forth happens behind the scenes as Office 365 alternates between changing the DKIM signing selector (selector1 vs selector2) which correspond to CNAME 1 and CNAME 2. Because the customer domain has pre-populated these fields in DNS, they don’t need to be aware of which selector or key is active because Office 365 is in control. And when it comes time to flip back to key #1, the blue messages will taper off during the transition period.
This is how DKIM key signing ought to work – keys should be updated automatically so customers (who don’t understand DKIM that well, and are reticent to make DNS changes) don’t have to do anything beyond the initial DNS record publishing. By operating as a middleman, Office 365 can handle it all automatically.
And I think this is a great improvement in how DKIM key management is done.