AD RMS redundancy and Fault Tolerance – Part deux

Note: I have recently changed roles to become part of the Information Protection Team at Microsoft (the group responsible for building AD RMS and related technologies) where I will be acting as a Sr. Program Manager. Since the team already has a blog on AD RMS I have decided to concentrate my efforts in that blog, which you can find at http://blogs.technet.com/rms. My previous blog posts have already been moved there and in the future you should go to that blog for updates and news (quite a few of them are coming!).

You can find this particular post at http://blogs.technet.com/b/rms/archive/2012/04/16/licenses-and-certificates-and-how-ad-rms-protects-and-consumes-documents.aspx.

 

 

In a previous post I discussed how to deploy the back-end components of the AD RMS infrastructure in a way that’s fault tolerant, or at least fault tolerant enough for the demands of AD RMS.

But of course, all that is not too useful unless the AD RMS servers themselves provide the necessary availability. Let’s discuss now how to deploy AD RMS in a way that provides fault tolerance.

First, we need to discuss what happens when the services provided by AD RMS are not available.

The AD RMS servers perform a few functions, the most salient ones:

  • Machine activation: the process of configuring a computer to enable it to securely process AD RMS-protected content.
  • User activation: the process of issuing certificates to the user so it can author and consume protected content. These are two separate functions actually, since the user is activated in two separate operations, one through which the user acquires its identity certificates for the purpose of authenticating in the AD RMS environment, another one through which the client acquires a certificate that will enable the user to protect content. But for the purposes of this discussion we can consider them just one step.
  • Protecting content online. AD RMS provides the capability to perform document protection by talking to a server to create the necessary publishing license. But in practice this functionality is not used, at least not today. All existing applications perform RMS protection in an offline fashion, without contacting RMS for each individual protection action. As long as the user is activated, protection can be performed without contacting the server.
  • Acquiring use licenses. This is necessary to consume protected content. But since most use licenses can be cached, a user only needs to acquire a use license for content that he or she hasn’t accessed before, content that wasn’t prelicensed by Microsoft Exchange or content whose use license has been expired or is marked as non-cacheable.

So the first conclusion is that when the AD RMS platform is down or unreachable, we can still perform some actions. We can consume content that we accessed previously for which we still have a valid license. We can consume content that was pre-licensed by Microsoft Exchange. We can protect new content, or reply to existing protected email.

But there are some things that can’t be done when the AD RMS infrastructure is unreachable. We can’t activate new users, or existing users in new devices. We can’t renew existing users certificates when they expire, which typically happens after one year. Perhaps most importantly, we can’t acquire licenses for new content for which we don’t have a valid use license already.

So we need to find ways to make our RMS infrastructure more resilient.

AD RMS provides the capability to deploy Licensing-Only servers. These servers are also called sub-enrolled servers are somewhat similar to a child Certification Authority in a PKI, since their Server Licensor Certificate is signed by the “parent” Certification RMS server (which is, for that reason, also called the Root RMS server). Licensing-only servers are similar to certification servers with two big exceptions. The first one has already been mentioned: the Server Licensor Certificate for a Licensing-Only server is not self-signed like that of a Certification Cluster, but it is signed by the “parent” certification server. The other one, more important, is that a Licensing-Only server, as its name implies, can only perform licensing functions. That is, it is not capable of performing the Certification functions, which basically consist on the issuance of machine and user identity certificates (SPC and RAC respectively, see my second post in this blog for more information). So it depends on the identity certificates issued by its parent Certification Server for validating users and for encrypting communications with those users.

There’s always a temptation to deploy Licensing-only servers to scale up an RMS deployment or to provide fault tolerance, but this doesn’t work because of a big reason. When you protect content with a licensing-only server, the content is encrypted (indirectly) with that server’s Server Licensor Certificate public key. Thus, only that server (or another server that shares the same SLC) will have the necessary private key to decrypt that content. Conversely, content protected with the Root server cannot be consumed by asking a license from a Licensing-Only server, since you need the private key of the root server in order to decrypt that content.

So, a Licensing-Only server is no good to consume content protected with the root server, and is thus no good to provide fault tolerance or redundancy to an existing RMS server. What are Licensing-Only servers good for? Well, that’s food for another post.

So what you need in order to provide redundancy to an existing RMS server? You need a server that has the same Server Licensor Certificate, or that at least has a copy of it. There are two ways in which that can happen.

The first one is if you install a new server and tell it to be part of the same cluster as the original server. This is an option during installation. You tell Setup to connect to the existing database of an AD RMS server, and it will add this node as a member of the AD RMS cluster. What’s more important, it will not create a new Server Licensor Certificate, but it will share the existing one (assuming you are not using a Hardware Security Module to protect the server keys, you will be prompted for the password used to encrypt the SLC private key in the RMS database of the original server).

After you have deployed a second node in an RMS cluster (or any number of additional nodes) you still don’t have redundancy, since your users will still be contacting the original RMS server for a license when they need it. In order to be able to consume protected content from either of the two (or more) nodes in the cluster, you need two things:

  1. Load Balancing. You need some hardware or software that balances traffic to any active node in the cluster, instead of going always to the same one.
  2. An alias. Since clients are going to be talking to the URL stamped in the protected documents to acquire a license, you need to map that URL’s server FQDN to the load-balanced IP address. This is why it is important never to configure the RMS URL to refer to the physical name of the first RMS server. Always use a DNS alias (typically a manually configured A record with a “fantasy” name such as “Rights” or “RMS”) for your cluster URLs when you install the first server. And use one that’s valid externally, even if you won’t be making the server accessible externally, just in case you change your mind in the future.

For Load Balancing you can use any of the common load balancers. You can use a hardware load balancer, which in many cases provide some niceties such as service failure detection and geographic awareness, or you can use Network Load Balancing, which is a component in Windows Server. While NLB does the trick and its’ free, it can be sometimes difficult to deploy if you don’t have the right network hardware (for instance, in many configurations, you will need two network cards on each server) and it is somewhat tricky to use in a virtualized environment, sometimes requiring the virtual switches in the virtualization hosts to be configured in certain ways. So if you are running physical servers and your networking infrastructure is somewhat modern, you should be able to use NLB without problems. But if you are virtualizing servers, or have some network hardware that might not have support for multicast traffic or doesn’t deal well with MAC address changes, you might be better off by using a hardware load balancer.

So is that all? Just install a second RMS server as part of the cluster, load balance it with the first one and you are all good? Well, actually yes. That’s all that’s needed to have a redundant RMS cluster.

Of course, you can then decide to do some interesting things, such as putting different RMS nodes in different locations and load balance between those locations (another advantage of using an external load balancer, since NLB is not too easy to configure in a geographically distributed fashion). Since AD RMS is not too DB-intensive and most access to the DB is asynchronous, it wouldn’t cause too much trouble to distribute the RMS servers this way, but keep in mind that your servers might refuse to boot if they can’t connect to the RMS DB due to excessive latency or network reliability problems, so make sure you have a decent connection between your different datacenters (according to my tests, no significant packet loss and a latency of less than 70ms) if you want to do something like this. If you do want to go the geographically distributed way to provide datacenter redundancy, you can put a bunch of RMS servers in each datacenter, put the RMS DB in one side and a stand-by DB server in the other one (see my previous post) and load balance with an external load balancer between the two sets of nodes. That way:

  • If one node fails, the other nodes will handle the load.
  • If the active database fails, you can quickly failover to the second database. This will involve activating the passive DB server (with the latest copy of the data), changing the DNS alias (or setting up a hosts file in each RMS server if changes in DNS are slow in your organization) and then recycling IIS in the RMS servers to force them to see the change. This might all take a few minutes, maybe a few hours, but since the RMS servers will continue to issue use licenses while you do this, most users shouldn’t see the impact.
  • If a whole datacenter fails, you can use the remaining RMS servers and DB server to continue operating. Depending on which side fails, this might be more or less automatic, but again users shouldn’t see much of an impact.

So if you are really worried about service availability, this design should provide you with very good availability at a reasonably low cost.

I mentioned before that there’s another way to make the private key of a server available to another server so it can issue licenses to content protected by the first. This is by actually exporting the Private Key from one server and importing it into another one. Actually, you need a bit more than that, since the second server will also need a copy of the full SLC and of all the RMS templates in the other server in order to be able to license content for it. Fortunately, this is all automatically done when you export/import a Trusted Publishing Domain. By exporting a TPD from one server and importing it into another you can enable the second server to issue licenses to content protected by the first server. But while this is a valid way to provide some sort of redundancy to your environment, installing additional nodes to the original cluster is usually much easier and more functional, so only in very specific cases with particular requirements (such as complete physical isolation between environments, imagine if you are deploying AD RMS in a submarine) this makes sense as a solution.

Trusted Publishing Domains have their uses in other situations, but I will discuss that in another future post.