The PDCe with too much to do

Hi. Mark again. As part of my role in Premier Field Engineering, I’m sometimes called upon to visit customers when they have a critical issue being worked by CTS, needing another set of eyes. For today’s discussion, I’m going to talk you through, one such visit.

It was a dark and stormy night …

Well not really – it was mid-afternoon but these sorts of things always have that sense of drama.

The Problem

Custom applications were hard coded to use the PDC Emulator (PDCe) for authentication – a strategy the customer later abandoned to eliminate a single point of failure. The issue was hot because the PDCe was not processing authentication requests after a reboot.

The customer had noticed lsass.exe consuming a lot of CPU and this is where CTS were focusing their efforts.

The Investigation

Starting with the Directory Service event logs, I noticed the following:

Event Type: Information

Event Source: NTDS Replication

Event Category: Replication

Event ID: 1555

Date: <Date>

Time: <Time>

User: NT AUTHORITY\ANONYMOUS LOGON

Computer: <Name of PDCe>

Description:

The local domain controller will not be advertised by the domain controller locator service as an available domain controller until it has completed an initial synchronization of each writeable directory partition that it holds. At this time, these initial synchronizations have not been completed.

 

The synchronizations will continue.

 

also:

Event Type: Warning

Event Source: NTDS Replication

Event Category: Replication

Event ID: 2094

Date: <Date>

Time: <Time>

User: NT AUTHORITY\ANONYMOUS LOGON

Computer: <Name of PDCe>

Description:

Performance warning: replication was delayed while applying changes to the following object. If this message occurs frequently, it indicates that the replication is occurring slowly and that the server may have difficulty keeping up with changes.

Object DN: CN=<ClientName>,OU=Workstations,OU=Machine Accounts,DC=<Domain Name>,DC=com

 

Object GUID: <GUID>

 

Partition DN: DC=<Domain Name>,DC=com

 

Server: <_msdcs DNS record of replication partner>

 

Elapsed Time (secs): 440

 

 

User Action

 

A common reason for seeing this delay is that this object is especially large, either in the size of its values, or in the number of values. You should first consider whether the application can be changed to reduce the amount of data stored on the object, or the number of values. If this is a large group or distribution list, you might consider raising the forest version to Windows Server 2003, since this will enable replication to work more efficiently. You should evaluate whether the server platform provides sufficient performance in terms of memory and processing power. Finally, you may want to consider tuning the Active Directory database by moving the database and logs to separate disk partitions.

 

If you wish to change the warning limit, the registry key is included below. A value of zero will disable the check.

 

Additional Data

 

Warning Limit (secs): 10

 

Limit Registry Key: System\CurrentControlSet\Services\NTDS\Parameters\Replicator maximum wait for update object (secs)

 

 

and:

Event Type: Warning

Event Source: NTDS General

Event Category: Replication

Event ID: 1079

Date: <Date>

Time: <Time>

User: <SID>

Computer: <Name of PDCe>

Description:

Internal event: Active Directory could not allocate enough memory to process replication tasks. Replication might be affected until more memory is available.

 

User Action

Increase the amount of physical memory or virtual memory and restart this domain controller.

 

 

In summary, the PDCe hasn’t completed initial synchronisation after a reboot and it’s having memory allocation problems while it works on sorting it out. Initial synchronisation is discussed in:

Initial synchronization requirements for Windows 2000 Server and Windows Server 2003 operations master role holders
https://support.microsoft.com/kb/305476

With this information in hand, I had a chat with the customer hoping we’d identify a relevant change in the environment leading up to the outage. It became apparent they’d configured a policy for deploying RDP session certificates. Furthermore, they’d noticed clients receiving many of these certificates instead of the expected one.

RDP session certificates are Secure Sockets Layer (SSL) certificates issued to Remote Desktop servers. It is also possible to deploy RDP session certificates to client operating systems such as Windows Vista and Windows 7. More on this later…

The customer and I examined a sample client and found 285 certificates! In addition to this unusual behaviour, the certificates were being published to Active Directory. There were 3700 affected clients – approx. 1 million certificates published to AD!

The Story So Far

We’ve injected huge amounts of certificate data into the userCertificate attribute of computer objects, we’ve got replication backlog due to memory allocation issues and the DC can’t complete an initial sync before advertising itself as a DC.

What Happened Next Uncle Mark?!

The CTS engineer back at home base wanted to gather some debug logging of LSASS.exe. While attempting to gather such a log, the PDCe became completely unresponsive and we had to reboot.

While the PDCe rebooted, the customer disabled the policy responsible for deploying RDP session certificates.

After the reboot, the PDCe had stopped logging event 1079 (for memory allocation failures) but in addition to event 1555 and 2094, we were now seeing:

Event Type Warning

Event Source: NTDS Replication

Event Category: DS RPC Client

Event ID: 1188

Date: <Date>

Time: <Time>

User: NT AUTHORITY\ANONYMOUS LOGON

Computer: <Name of PDCe >

Description:

A thread in Active Directory is waiting for the completion of a RPC made to the following domain controller.

 

Domain controller:

<_msdcs DNS record of replication partner>

Operation:

get changes

Thread ID:

<Thread ID>

Timeout period (minutes):

5

 

Active Directory has attempted to cancel the call and recover this thread.

 

User Action

If this condition continues, restart the domain controller.

 

For more information, see Help and Support Center at https://go.microsoft.com/fwlink/events.asp.

A bit more investigation with:

Repadmin.exe /showreps (or /showrepl for later versions of repadmin)

told us that all partitions were in sync except the domain partition – the partition with a million certificates attached to computer objects.

We decided to execute:

Repadmin.exe /replicate <Name of PDCe> <Closest Replication Partner> <Domain Naming Context> /force

Next, we waited … for several hours.

While waiting, we considered:

  • Disabling initial sync with:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters]

Repl Perform Initial Synchronizations = 0

  • Increasing the RPC timeout for NTDS with:

https://support.microsoft.com/default.aspx?scid=kb;EN-US;830746

Both of these changes require a reboot. The customer was hesitant to reboot again and while they thought it over, initial sync completed.

With the PDCe authenticating clients, I headed home to get some sleep. The customer had disabled the RDP session certificate deployment policy and was busy clearing the certificate data out of computer objects in Active Directory.

Why?

The next day, I went looking for root cause. The customer had followed some guidance to deploy the RDP session certificates. Some of the guidance noted during the investigation is posted here:

https://blogs.msdn.com/b/rds/archive/2010/04/09/configuring-remote-desktop-certificates.aspx

I set up a test environment and walked through the guidance. After doing so, I did not experience the issue. I was getting a single certificate no matter how often I would reboot or apply Group Policy. In addition, RDP session certificates were not being published in Active Directory. Publishing in Active Directory is easily explained by this checkbox:

image

An examination of the certificate template confirmed they had this checked.

So why were clients in the customer environment receiving multiple certificates while clients in my test environment received just one?

The Win

I noticed the following point in the guidance being followed by the customer:

image

A bit of an odd recommendation. Sure enough, the customer’s template had different names for “Template display name” and “Template name”. I changed my test environment to make the same mistake and suddenly I had a repro – a new certificate on every reboot and policy refresh.

Some research revealed that this was a known issue. One of these fields checks whether an RDP session certificate exists while the other field obtains a new certificate. Giving both fields the same name works around the problem.

Conclusion

So in the aftermath of this incident, there are some general recommendation that anyone can take to help avoid this kind of situation.

  • Follow our guidance carefully – even the weird stuff
  • Test before you deploy
  • Deploy the same way as you test
  • Avoid making critical servers more critical than they need to be

- Mark “Falkor” Renoden