Optimizing NTLM authentication flow in multi-domain environments
I’ll start with the obvious: Kerberos is the way to go. NTLM is less secure and is being de-emphasized in the recent versions of the OS. Your first option should always be to attempt to make your applications work with Kerberos. But things take time and it will be long till we find ourselves in the NTLM-less environment.
But we live in the real world and find ourselves dealing with NTLM on a daily basis, so lets start with some background and look at how basic NTLM authentication works (this is explained here in more details)
- (Interactive authentication only) A user accesses a client computer and provides a domain name, user name, and password. The client computes a cryptographic hash of the password and discards the actual password (note that this is not the actual user authentication, which would preferably be done using Kerberos).
- The client sends the user name to the server (in plaintext).
- The server generates a 16-byte random number, called a challenge or nonce, and sends it to the client.
- The client encrypts this challenge with the hash of the user's password and returns the result to the server. This is called the response.
- The server sends the following three items to the domain controller:
- User Name
- Challenge sent to the client
- Response received from the client
- The domain controller uses the user name to retrieve the hash of the user's password from the Security Account Manager database. It uses this password hash to encrypt the challenge.
- The domain controller compares the encrypted challenge it computed (in step 6) to the response computed by the client (in step 4). If they are identical, authentication is successful and the domain controller notifies the server.
As you can see, NTLM is quite chatty and requires the resource being accessed to be able to successfully authenticate the user credentials by consulting a Domain Controller from the user’s domain.
Nothing new meanwhile, right ? So let’s complicate things a bit . Lets look at what happens when our environment consists of a single forest with 3 domains (forest root domain and 2 child domains):
The specific scenario I want to discuss is a user from AMERICAS domain accessing a resource on server in EMEA domain and using (god behold) NTLM. If we look at what happens behind the scenes, we will see the following flow:
Not exactly something you would expect, right ? Waaaay far from optimal.
- Client (GUYTEWKS) sends username (EMEA\guyte) to FILESRV01.americas.bigcorp.net
- FILESRV01 generates NTLM challenge and sends it back to the client
- The client sends NTLM response, containing the encrypted user password hash, to FILESRV01
- FILESRV01 passes the authentication request to a DC in the domain it is member of to which it has a secure channel active (USDC01.americas.bigcorp.net in our case)
- The USDC01 DC sees that the authentication request is not for its own domain and consults the trust information. AMERICAS domain does not have a direct trust with EMEA domain, but the forest root domain does, hence the DC in AMERICAS domain decides to rout the request to a DC in ROOT domain
- USDC01 passes the authentication request to a DC in ROOT domain to which it has an active secure channel (ROOTDC01.bigcorp.net)
- ROOTDC01 receives the request, but can not authenticate it locally. Fortunately ROOT domain has a trust with EMEA domain and ROOTDC01 routs the request to a DC in EMEA domain
- ROOTDC01 passes the authentication request to a DC in EMEA domain to which it has an active secure channel (EUDC01.emea.bigcorp.net)
- EUDC01 sees that it can authenticate the request locally and verifies the credentials.
- We are going all the way back (the way we came from) to FILESRV01 to tell it whether the authentication attempt was successful
Quite a trip, ah ? I bet quite a few folks overlooked the fact that we will need to visit the forest root DCs to complete the authentication. Now, armed with that knowledge, lets look at the following scenario:
See the problem ? In this scenario, in order to authenticate a user, physically located at EU-UK site, who is accessing a resource in US-NY site over slow site link, we need to go over WAN to consult a DC in EMEA domain, effectively increasing the time it takes to authenticate the user. Taking a closer look at the behavior of the netlogon service will reveal the following (Nick, thanks for the info !):
- Netlogon sets up a number of TCP streams (governed by MaxConcurrentApi setting which controls the number of threads reserved for simultaneous authentication requests in netlogon). By default MaxConcurrentApi over trust is limited to 1.
- For each trusted domain a DC maintains a secure channel with a DC from that domain (the DC is located by the means of DC Locator mechanism)
- Each DC measures the per-secure channel authentication response time:
- If a single authentication request takes more than 15 seconds, a variable called COUNT is increased
- If COUNT >= 2, the secure channel is marked as failed and DC discovery is initiated (the DC we were talking to is considered at that point as slow or unresponsive and another DC is picked)
- If a single authentication request takes more than 45 seconds or a call waits for a slot for more than 45 seconds (remember that the number of simultaneous authentication requests is governed by MaxConcurrentApi), the secure channel is marked as failed and DC discovery is initiated (we pick another DC)
- If 5 fast (< 0.4 seconds) authentication requests are performed in a raw on the same secure channel, we decrease the COUNT by 1. This mechanism is in place to prevent re-establishment of a secure channel in a case of intermittent slowness
Enter the secure channel chain effect !
Assume for a moment that the WAN link between UK and NY is satturated or EUDC01 is overloaded from some reason and the secure channels are as outlined below:
- FILESRV01 has a secure channel to USDC01.
- USDC01 has a secure channel to ROOTDC01 (for ROOT domain)
- ROOTDC01 has a secure channel to EUDC01 (for EMEA domain)
- FILESRV01 sends authentication request for EMEA\guyte to USDC01
- USDC01 passes the request to ROOTDC01
- ROOTDC01 issues an authentication request to EUDC01
- Because the network is slow or EUDC01 is overloaded, the call times out after 45 seconds (or COUNT >= 2 for the secure channel)
- ROOTDC01 marks the secure channel to EUDC01 as failed and initiates rediscovery
- the authentication request from USDC01 to ROOTDC01 times out too.
- USDC01 marks the secure channel to ROOTDC01 as failed and USDC01 tries to locate a new DC for ROOT domain
- Authentication request on FILESRV01 times out and the secure channel to USDC01 is marked as failed and FILESRV01 tries to locate another DC in AMERICAS domain.
See what happened here ? A failure to authenticate a user from EMEA domain resulted in a member server in AMERICAS domain considering its local DC unresponsive and switching to another DC. Chain effect in action.
So what can be done to optimize the NTLM authentication flow ?
1) Shortcut trusts to the rescue !
After the shortcut trust is established, in the scenario above, the additional hop to the forest root domain is eliminated, as DCs in AMERICAS domain will have a secure channel to DCs in EMEA domain.
2) In the sites with resources that will be accessed by users from remote domains and will use NTLM for authentication, consider introducing DCs from those remote domains:
3) Monitor the netlogon performance counters. The counters you are interested in are outlined at the bottom of the following blog post: http://blogs.technet.com/b/mikelag/archive/2009/08/04/the-case-of-the-mysterious-exchange-server-hang.aspx
4) Monitor the secure channel on member servers and DCs (in multi-domain scenarios) using nltest.exe
nltest /SC_QUERY:<domain name> will show you the DC the server has a secure channel with for the domain specified. If you start seeing frequent changes, it’s time to fire up perfmon and use the counters from previous bullet.