NTLM Authentication with SharePoint Part 2

In my last post I laid out the basic flow of NTLM authentication with SharePoint when all the accounts (user, service and machine) reside in the same domain. In this post I will discuss the implications of multiple domains in two different scenarios.

Scenario 1:

Active Directory Forest=Farbrikam.com; Domain for users= CHILD.Fabrikam.com; SharePoint WFE, SQL DB have machine accounts in Fabrikam.com; SharePoint Application Pool and SQL Service accounts are in Fabrikam.com.

In this scenario the secure channel DC servicing SharePoint has to contact its peer DC in the CHILD domain via the trust. By default the MaxConcurrentApi for a Domain controller over a trust is one. That’s right one concurrent request (one user at a time) will be processed over the trust. That’s why adjusting the MaxConcurrentApi on the DC’s servicing SharePoint (or any other high volume application, ISA comes to mind) is important. Again profile and test don’t just jump to ten.

Scenario 2:

Active Directory Forest=Farbrikam.com; Domain for users= CHILD.Fabrikam.com; SharePoint WFE, SQL DB have machine accounts in Fabrikam.com; SharePoint Application Pool and SQL Service accounts are in GrandChild.Fabrikam.com.

In this scenario you have the same need to walk the trust for users but you also have a new need to walk the trust for the service accounts.

These two scenarios require another item to consider under high volumes of authentication, Secure Channel “float”. There are a handful of reasons as to why secure channel resets to a different DC. The first is a response greater than or equal to 45 seconds. This is usually the result of a secure channel being established over a slow link or a Secure Channel to a DC that is overloaded (high CPU). Second, there is a network failure to get to the secure channel DC. This can be caused by a physical network failure; Spanning Tree running on the switch which is outlined here; a hiccup from auto negotiate (determining the speed and duplex settings) at the NIC to the switch outlined here; or the Secure Channel DC being rebooted. Once the secure channel is unbound from a DC it goes through the DC Locator process to find a DC. If you have multiple geographical sites in your environment it is important to designate Active Directory Sites to keep your SharePoint servers using local DC’s. Under a high load the last thing you want is your Secure Channel DC being over a slow WAN link and this can happen if you don’t architect this into your design. This can also happen if you place DC/GC over slow links for the domains you are authenticating. For example, in Scenario 1 if the DC/GC for the CHILD domain is over a slow link a bottleneck will be possible. The better design would have DC/GCs for the Fabrikam.com and CHILD domains close (high speed links) to the SharePoint servers and an Active Directory Site specified to keep Secure Channels local if the DC Locator process is called.

To sum up my recommendations for best performance:

  1. Consider creating an Active Directory site just for the SharePoint boxes (if in the same forest) and add GC’s for each domain going against SharePoint.
  2. Make certain that the DC/GC’s are physically as close (high speed links) as possible to the SharePoint boxes.
  3. If possible make all DC’s GC’s if in Native Mode.
  4. Hard set NIC’s and Switches Speed and duplex settings to avoid loss of connecting during auto negotiate.
  5. Check with your switch vendor on the settings for spanning tree to avoid Secure Channel drops. Most vendors have an option to keep this from happening while still benefiting from Spanning Tree.
  6. Increase MaxConcurrentApi and profile DC/GC (for domains in play) with SPA to see if they can handle the load. Make certain to do this on the SharePoint servers and DC/GC for all domains in play.
  7. Monitor Secure Channels with NLTest.exe after patches that cause a reboot to ensure that secure channels don’t float to slow link DC/GCs.
  8. For extreme performance consider the use of x64 DC/GCs. See the impressive results here.
  9. If possible change to Kerberos authentication.

To see a good explanation as to the troubleshooting process check out SPAT’s blog post on the subject.

Why am I taking the time to point this out with regard to SharePoint specifically? Because slow NTLM authentication is one of the leading causes of the dreaded Cannot connect to the configuration/site database and this is rarely considered in troubleshooting this error (problem). It is also a factor in slow portal search crawls because of the number of Group Membership evaluations that are required for Security Trimming.