The Case of the Enormous CA Database
Hello, faithful readers! Jonathan here again. Today I want to talk a little about Certification Authority monitoring and maintenance. This topic was brought to my attention by a recent case that I had where a customer’s CA database had grown to rather elephantine proportions over the course of many months quite unbeknownst to the administrators. In fact, the problem didn’t come to anyone’s attention until the CA database had consumed nearly all of the 55 GB partition on which it resided. How many of you may be in this same situation and be completely unaware of it? Hmmm? Well, in this post, I’ll first go over the details of the issue and the steps we took to resolve the immediate crisis. In the second part, I’ll cover some processes and tools you can put in place to both maintain your CA database and also alert you to possible problems that may increase its size.
Once upon a time, Roger contacted Microsoft Support and reported that he had a problem. His Windows Server 2003 Enterprise CA database, which had been given its own partition, had grown to over 50 GB in size, and was still growing. The partition itself was only 55 GB in size, so Roger asked if there is any way to compact the CA database before the CA failed due to a lack of disk space.
Actually, compacting the CA database is a simple process, and while this isn’t a terribly common request we’re pretty familiar with the steps. What made this case so unusual was the sheer size of the database file. Previously, the largest CA database I’d ever seen was only about 21 GB, and this one was over twice that size! But no matter. The principles are the same regardless, and so we went to it.
Compacting the CA Database
Compacting a CA database is essentially a two-step process. The first step is to delete any unnecessary rows from the CA database. This will leave behind what we call white space in the database file that can be reused by the CA for any new records that it adds. If we just removed the unneeded records the size of the database file would not be reduced, but we could be confident that the database file would grow no larger in size.
If the database file were smaller, this might be an acceptable solution. In this case, the size of the database file relative to the size of the partition on which it resided mandated that we also compact the database file itself.
If you are familiar with compacting the Active Directory database on a domain controller, then you will realize that this process is identical. A new database file is created and all the active records are copied from the old database file to the new database file, thus removing any of the white space. When finished, the old database file is deleted and the new file is renamed in place with the name of the old file. While actually performing the compaction, Certificate Services must be disabled.
At the end of this process, we should have a significantly smaller database file, and with appropriate monitoring and maintenance in the future we can ensure that it never reaches such difficult to manage proportions again.
What to Delete?
What rows can we safely delete from the CA database? First, you need to have a basic understanding of what exactly is stored in the CA database. When a new certificate request is submitted to the CA a new row is created in the database. As that request is processed by the CA the various fields in that row are updated and the status of each request at a particular point in time describes at what point in the process the request is. What are the possible states for each row?
- Pending - A pending request is basically on hold until an Administrator manually approves the request. When approved, the request is re-submitted to the CA to be processed. On a Standalone CA, all certificate requests are pended by default. On an Enterprise CA, certificate requests are pended if the option to require CA Manager approval is selected in the certificate template.
- Failed - A failed request is one that has been denied by the CA because the request isn’t suitable per the CA’s policy, or there was an error encountered while generating the certificate. One example of such an error is if the certificate template is configured to require key archival, but no Key Recovery Agents are configured on the CA. Such a request will fail.
- Issued - The request has been processed successfully and the certificate has been issued.
- Revoked - The certificate request has been processed and the certificate issued, but the administrator has revoked the certificate.
In addition, issued and revoked certificates can either be time valid or expired.
These states, and whether or not a certificate is expired, need to be taken into account when considering which rows to delete. For example, you do not want to delete the row for a time valid, issued certificate, and in fact, you won’t be able to. You won’t be able to delete the row for a time valid, revoked certificate either because this information is necessary in order for the CA to periodically build its certificate revocation list (CRL).
Once a certificate has expired, however, then Certificate Services will allow you to delete its row. Expired certificates are no longer valid on their face, so there is no need to retain any revocation status. On the other hand, if you’ve enabled key archival then you may have private keys stored in the database row as well, and if you delete the row you’d never be able to recover those private keys.
That leaves failed and pending requests. These rows are just requests; there are no issued certificates associated with them. In addition, while technically a failed request can be resubmitted to the CA by the Administrator, unless the cause of the original failure is addressed there is little purpose in doing so. In practice, you can safely delete failed requests. Any pending requests should probably be examined by an Administrator before you delete them. A pending request means that someone out there has an outstanding certificate request for which they are patiently waiting on an answer. The Administrator should go through and either issue or deny any pending requests to clear that queue, rather than just deleting the records.
In this customer’s case, we decided to delete all the failed requests. But first, we had to determine exactly why the database had grown to such huge proportions.
Fix the Root Problems, First
Before you start deleting the failed requests from the database, you should ensure that you have addressed any configuration issues that led to these failures to begin with. Remember, Roger reported that the database was continuing to grow in size. It would make little sense to start deleting failed requests -- a process that requires that the CA be up and running -- if there are new requests being submitted to the CA and subsequently failing. The rows you delete could just be replaced by more failed rows and you’ll have gained nothing.
In this particular case, we found that there were indeed many request failures still being reported by the CA. These had to be addressed before we could actually do anything about the size of the CA database. When we checked the application log, we saw that Certificate Services was recording event ID 53 warnings and event ID 22 errors for multiple users. Let’s look at these events.
Event ID 53
Event ID 53 is a warning event indicating that the submitted request was denied, and containing information about why it was denied. This is a generic event whose detailed message takes the form of:
Certificate Services denied request %1 because %2. The request was for %3. Additional information: %4
%1: Request ID
%2: Reason request was denied
%3: Account from which the request was submitted
%4: Additional information
In this particular case, the actual event looked like this:
Event Type: Warning
Event Source: CertSvc
Event Category: None
Event ID: 53
Computer: <CA server>
Certificate Services denied request 22632 because The EMail name is unavailable and cannot be added to the Subject or Subject Alternate name. 0x80094812 (-2146875374). The request was for CORP02\jackburton. Additional information: Denied by Policy Module
This event means that the certificate template is configured to include the user’s email address in the Subject field, the Subject Alternative Name extension, or both, and that this particular user does not have an email address configured. When we looked at the users for which this event was being recorded, they were all either service accounts or test users. These are accounts for which there would probably be no email address configured under normal circumstances. Contributing to the problem was the fact that user autoenrollment had been enabled at the domain level by policy, and the Domain Users group had permissions to autoenroll for this particular template.
In general, one probably shouldn’t configure autoenrollment for service accounts or test accounts without specific reasons. In this case, simple User certificates intended for “real” users certainly don’t apply to these types of accounts. The suggestion in this case would be to create a separate OU wherein user autoenrollment is disabled by policy, and then place all service and test accounts in that OU. Another option is to create a group for all service and test accounts, and then deny that group Autoenroll permissions on the template. Either way, these particular users won’t attempt to autoenroll for the certificates intended for your users which will eliminate these events.
For information on troubleshooting other possible causes of these warning events, check out this link.
Event ID 22
Event ID 22 is an error event indicating that the CA was unable to process the request due to an internal failure. Fortunately, this event also tells you what the failure was. This is a generic event whose detailed message takes the form of:
Certificate Services could not process request %1 due to an error: %2. The request was for %3. Additional information: %4
%1: Request ID
%2: The internal error
%3: Account from which the request was submitted
%4: Additional information
In this particular case, the actual event looked like this:
Event Type: Error
Event Source: CertSvc
Event Category: None
Event ID: 22
Computer: <CA server>
Certificate Services could not process request 22631 due to an error: Cannot archive private key. The certification authority is not configured for key archival. 0x8009400a (-2146877430). The request was for CORP02\david.lo.pan. Additional information: Error Archiving Private Key
This event means that the certificate template is configured for key archival but the CA is not. A CA will not accept the user’s encrypted private key in the request if there are no valid Key Recovery Agent (KRA) configured. The fix for this is pretty simple for our current purposes; disable key archival in the template. If you actually need to archive keys for this particular template then you should set that up before you start removing failed requests from your database. Here are some links to more information on that topic:
Template, Template, Where’s the Template?
What’s the fastest way to determine which template is actually associated with each of these events? You can find that by looking at the failed request entry in the Certification Authority MMC snap-in (certsrv.msc). If you have more than a couple hundred failed requests, however, find the one you actually want can be difficult. This is where filtering the view comes in handy.
1. In the Certification Authority MMC snap-in, right-click on Failed Requests, select View, then select Filter… .
2. In the Filter dialog box, click Add… .
3. In the New Restriction dialog box, set the Request ID to the value that you see in the event, and click Ok.
4. In the Filter dialog box, click Ok.
5. Now you should see just the failed request designated in the event. Right-click on it, select All Tasks, and then select View Attributes/Extensions… .
6. In the properties for this request, click on the Extensions tab. In the list of extensions, locate Certificate Template Information. The template name will be show in the extension details.
This is the name of the template whose settings you should review and correct, if necessary.
Once the root problems causing the failed requests have been resolved, monitor the Application event log to ensure that Certificate Services is not logging any more failed requests. Some failed requests in a large environment are expected. That’s just the CA doing its job. What you’re trying to eliminate are the large bulk of the failures caused by certificate template and CA misconfiguration. Once this is complete, you’re ready to start deleting rows from the database.
Deleting the Failed Requests
The next step in this process is to actually delete the rows using our trusty command line utility certutil.exe. The -deleterow verb, introduced in Windows Server 2003, can be used to delete rows from the CA database. You just provide it with the type of records you want deleted and a past date (if you use a date equal to the current date or later, the command will fail). Certutil.exe will then delete the rows of that type where the date the request was submitted to the CA (or the date of expiration, for issued certificates) is earlier than the date you provide. The supported types of records are:
Type of date
Failed and pending requests
Expired and revoked certificates
For example, if you want to delete all failed and pending requests submitted by January 22, 2001, the command is:
C:\>Certutil -deleterow 1/22/2001 Request
The only problem with this approach is that certutil.exe will only delete about 2,000 - 3,000 records at a time before failing due to exhaustion of the version store. Luckily, we can wrap this command in a simple batch file that runs the command over and over until all the designated records have been removed.
Certutil -deleterow 8/31/2010 Request
If %ERRORLEVEL% EQU -939523027 goto Top
This batch file runs certutil.exe with the -deleterow verb. If the command fails with the specific error code indicating that the version store has been exhausted, the batch file simply loops and the command is executed again. Eventually, the certutil.exe command will exit with an ERRORLEVEL value of 0, indicating success. The script will then exit.
Every time the command executes, it will display how many records were deleted. You may therefore want to pipe the output of the command to a text file from which you can total up these values and determine how many records in total were deleted.
In Roger’s case, the total number of deleted records came to about 7.8 million rows. Yes…that is 7.8 million failed requests. The script above ran for the better part of a week, but the CA was up and running the entire time so there was no outage. Indeed, the CA must be up and running for the certutil.exe command to work as certutil.exe communicates with the ICertAdmin COM interface of Certificate Services.
That is not to say that one should not take precautions ahead of time. We increased the base CRL publication interval to seven days and published a new base CRL immediately before starting to delete the rows. We also disabled delta CRLs temporarily while the script was running. We did this so that even if something unexpected happen, clients would still be able to check the revocation status of certificates issued by the CA for an extended period, giving us the luxury of time to take any necessary remediation steps. As expected, however, none were required.
And Finally, Compaction
The final step in this process is compacting the CA database file to remove all the white space resulting from deleting the failed requests from the database. This process is identical to defragmenting and compacting Active Directory’s ntds.dit file, as the Certificate Services uses the same underlying database technology as Active Directory -- the Extensible Storage Engine (ESE).
Just as with AD, you must have free space on the partition equal to or greater than the database file size. As you’ll recall, we certainly didn’t have that in this case what with a database of 50 GB on a 55 GB partition. What do you do in this case? Move the database and log files to a partition with enough free space, of course.
Fortunately, Roger’s backing store was on a Storage Area Network (SAN), so it was trivial to slice off a new 150 GB partition and move the database and log files to the new, larger partition. We didn’t even have to modify the CA configuration as Roger’s storage admins were able to just swap drive letters since the only thing on the original partition was the CertLog folder containing the CA database and log files. Good planning, that.
With enough free space now available, all is ready to compact the database. Well…almost. You should first take the precaution of backing up the CA database prior to starting just in case something goes wrong. The added benefit to backing up the CA database is that you’ll truncate the database log files. In Roger’s case, after deleting 7.8 million records there were several hundred megabytes of log files. To back up just the CA database, run the following command:
C:\>Certutil -backupDB backupDirectory
The backup directory will be created for you if it does not already exist, but if it does exist, it must be empty. Once you have the backup, copy it somewhere safe. And now we’re finally ready to proceed.
To compact the CA database, stop and then disable Certificate Services. The CA cannot be online during this process. Next, run the following command:
C:\>Esentutl /d Path\CaDatabase.edb
Esentutl.exe will take care of the rest. In the background, esentutl.exe will create a temporary database file and copy all the active records from the current database file to the new one. When the process is complete, the original database file will be deleted and the temporary file renamed to match the original. The only difference is that the database file should be much smaller.
How much smaller? Try 2.8 GB. That’s right. By deleting 7.8 million records and compacting the database, we recovered over 47 GB of disk space. Your own mileage may vary, though, as it depends on the number of failed requests in your own database. To finish, we just copied the now much smaller database and log files to the original drive and then re-enabled and restarted Certificate Services.
While very time consuming, simply due to the sheer number of failed requests in the database, overall the operation went off without a hitch. And everyone lived happily ever after.
Preventative Maintenance and Monitoring
Now that the CA database is back down to its fighting weight, how do you make sure you keep it that way? There are actually several things you can do, including regular maintenance and, if you have the capability, closer monitoring of the CA itself.
You’ll remember that it was not necessary to take the CA offline while deleting the failed requests. We did take precautions by modifying the CRL publication interval but fortunately that turned out to be unnecessary. Since no outage is required to remove failed requests from the CA database, it should be pretty simple to get approval to add it to your regular maintenance cycle. (You do have one, right?) Every quarter or so, run the script to delete the failed requests. You can do it more or less often as is appropriate for your own environment.
You don’t have to compact the CA database each time. Remember, the white space will simply be reused by the CA for processing new requests. Over time, you may find that you reach a sort of equilibrium, especially if you also have the freedom to delete expired certificates as well (i.e., no Key Archival), where the CA database just doesn’t get any bigger. Rows are deleted and new rows are created in roughly equal numbers, and the space within the database file is reused over and over -- a state of happy homeostasis.
If you want, you can even use scheduled tasks to automatically perform this maintenance every three months. The batch file above can be modified to run using VBScript or even PowerShell. Simply add some code to email yourself a report when the deletion process is finished; there are plenty of code samples available on the web for sending email using both VBScript and PowerShell. Bing it!
In addition to this maintenance, you can also use almost any monitoring or management software to watch for certain key events on the CA. Those key events? I already covered two of them above -- event IDs 53 and 22. For a complete list of events recorded by Certificate Services, look here.
If you have Microsoft Operations Manager (MOM) 2005 or System Center Operations Manager (SCOM) 2007 deployed, and you have Windows Server 2008 or Windows Server 2008 R2 CAs, then you can download the appropriate management pack to assist you with your monitoring.
The management packs encompass event monitoring and prescriptive guidance and troubleshooting steps to make managing your PKI much simpler. These management packs are only supported for CAs running on Windows Server 2008 or higher, so this is yet one more reason to upgrade those CAs.
Like any other infrastructure service in your enterprise environment, the Windows CA does require some maintenance and monitoring to maintain its viability over time. If you don’t pay attention to it, you may find yourself in a situation similar to Roger’s, not noticing the problem until it is almost too late to do anything to prevent an outage. With proper monitoring, you can become aware of any serious problems almost as soon as they begin, and with regular maintenance you prevent such problems from ever occurring. I hope you find the information in this post useful.
Jonathan “Pork Chop Express” Stephens