The Case of the Process Startup Delays
I’ve been extremely busy here at Microsoft and so haven’t had time to blog until now, but plan on getting back to posting regularly. Before I start with a look at a technical problem I ran into recently, I’m pleased to report that the Sysinternals integration is proceeding smoothly and that Bryce and I will unveil an exciting new tool when the site moves to its new home under Microsoft TechNet in late October.
I don’t use my laptop much when I’m not traveling, but I occasionally read email on it in the living room. Like most Windows users, I’m frustrated by occasional unexplained delays when I perform routine tasks like start a program or open a web page. Since joining my laptop to an internal Microsoft domain, I began experiencing regular delays when starting processes. With my Sysinternals tools arsenal in hand, I set out to investigate the root cause, suspecting that recently joining the laptop to the Microsoft domain played a role.
I began my research by first noticing that, after a delay of a few seconds starting a new process, processes I started within the following 30 seconds launched instantly. I therefore started Process Explorer, waited for 30 seconds, and then executed Notepad from Explorer’s Run dialog. Notepad didn’t appear in Process Explorer’s process tree during the expected delay, which implied the Explorer thread starting Notepad was experiencing the pause, not Notepad’s startup.
A look at the stack of the launching Explorer thread might give me a hint about the cause, but I was too impatient to look at each of Explorer’s over a dozen threads and so attached the Windbg debugger from the Microsoft Debugging Tools for Windows to Process Explorer, launched Notepad with Process Explorer’s Run dialog, and broke into the debugger. I opened the thread list in Windbg by selecting Processes and Threads from the View menu, selected the first one displayed, and then revisited the View menu to open the Call Stack dialog:
Stacks display with the most recently called function at the top, so the ZwWaitForSingleObject frame at the top meant that the thread was waiting on some object to become signaled. The stack frames further up the stack are in the RPCRT4 (Remote Procedure Call Runtime Version 4) DLL and the reference to the OpenLpcPort function told me that the thread was trying to initiate a RPC with another process on the same system. It looked like the wait might be due to the GetMachineAccountSid call highlighted in the screenshot. Just as for domain user accounts, computers belonging to a domain have accounts and GetMachineAccountSid’s name implies that the function returns the Security Identifier (SID) of the computer’s domain account.
I set a breakpoint on the return from the call to GetMachineAccountSid in the OpenLpcPort function and after a short pause consistent with the startup delays, the debugger’s command prompt activated. The x86 calling convention is for function return values to be passed in the EAX register so I examined its value:
After translating the value to decimal with the “?” command I searched for 1789 in the global error definitions file, WinError.h, of the Platform SDK:
I scoured MSDN documentation and the Web and found essentially no information on the underlying cause for that error code. However, the term “trusted relationship failure” implies that the domain the computer is connecting to doesn’t trust the domain of the computer. But under the circumstances the error didn’t make sense, because I was disconnected from the network and even if the computer was trying to connect a domain, the only one it would connect is the one it belonged to.
On a hunch, though, I opened a command prompt and ran PsGetSid to see what error it would get when trying to look up the computer’s domain SID (a computer’s domain account name is the computer name with a “$” appended to it):
Sure enough, it experienced the same delay, which must be a network timeout, and got the same error. Then I used remote access to connect to the domain and ran the command again:
Further, after connecting to the domain I no longer experienced the startup delays. I disconnected, but continued to have delay-free process launches, even after 30 seconds. After rebooting and not connecting, though, the delays reoccurred.
At that point I decided to investigate the internals of GetMachineAccountSid. The stack trace showed that it calls into the Netlogon DLL, which performs its own RPC to a function called NetrLogonGetTrustRid. I knew that the Netlogon service runs inside of the Local Security Authority Subsystem (LSASS):
I attached Windbg to LSASS and set a breakpoint on NetrLogonGetTrustRid. After launching a new process I hit the breakpoint and saw that if a particular field in a data structure is NULL, Netlogon tries to connect to a domain controller, but if the connection fails for any reason, it blindly returns error 1789. However, when I connected to the domain the call succeeded and the value in the data structure filled in with the SID of the computer account, which persists even after a disconnect. That explained the change in behavior after I connected to the domain.
Turning my attention back to GetMachineAccountSid, I found that it caches the results of an error for 30 seconds before asking NetLogon to attempt to connect to a domain controller again. That explained the 30 second quick-start periods. A look through the code flow in the debugger also revealed that OpenLpcPort queries the computer SID as part of a check to see if it matches the SID passed as one of its parameters. If so, OpenLpcPort changes the SID to the SID of the Local System account before calling NtSecureConnectPort, mapping the domain SID to a local one. NtSecureConnectPort takes a SID as a parameter and will only connect to the specified Local Procedure Call (LPC) port if the port was created by the account that matches the SID.
I’d answered a number of questions, but the big one remained: why was an RPC happening during a process launch at all? The initial stack trace only went up as far as the NegotiateTransferSyntax frame, but there were obviously other frames that the symbol engine couldn’t determine. The stack display went further when I had hit the breakpoint I set in OpenLpcPort, though:
Near the bottom you can see the call to ShellExecCmdLine that the CRunDlg class, which is responsible for the Run dialog implementation, calls. That eventually results in what looks like the execution of shell execute hook extensions, and the one that makes the RPC call is implemented by the MpShHook DLL. I didn’t know off hand what that DLL was, but Process Explorer’s DLL view showed that it’s part of Windows Defender:
I suspected that the hook is part of Windows Defender’s real-time protection, which the Windows Defender team confirmed. Autoruns reports that Windows Defender registers the shell execute hook:
The mystery was solved! Putting it all together:
- Explorer’s Run dialog calls ShellExecuteCmdLine
- ShellExecuteCmdLine calls out to shell execute hooks
- Windows Defender’s hook for real-time protection, MpShHook.Dll, calls RPC to communicate with the Windows Defender service, passing the SID of the service as an argument
- The RPC library calls GetMachineAccountSid to see if the SID matches the computer’s domain SID, in which case it would map the SID to the local system account SID
- GetMachineAccountSid performs an RPC to the Netlogon service to get the computer account’s SID
- If the computer account’s SID hasn’t been obtained already, Netlogon tries to connect to a domain controller
- If the domain controller connection fails after a timeout (the delay), Netlogon returns a trust-relationship failure error
- The Windows Defender RPC proceeds using the unmapped SID
- Windows Defender’s service performs real-time checks and then process launches
A little more research led me to conclude that the delay only happens under very specific circumstances where:
- The system is running Windows XP 64-bit for x64 or Windows Server 2003 SP1
- Windows Defender Beta 2 is active
- The system is domain joined, but has not connected to the domain in the current boot session
32-bit Windows XP doesn’t perform the SID mapping in OpenLpcPort and Windows Defender doesn’t use a shell execute hook on Windows Vista. The Windows Defender team is looking at workarounds for the next release, but now that I understand the delay I can work around it.
Comments
Anonymous
January 01, 2003
PingBack from http://someguywitha.com/2006/08/31/resistance-is-futile/Anonymous
January 01, 2003
Mark now has a blog at http://blogs.technet.com/markrussinovich/ and starts out with a typically...Anonymous
January 01, 2003
PingBack from http://soci.hu/blog/index.php/2006/09/17/szeretnek-en-igy-debugolni-mint-ez-a-russinovich-gyerek/Anonymous
January 01, 2003
PingBack from http://stuart.amanzi.co.nz/2006/09/04/some-new-feeds-for-feeddemon/Anonymous
January 01, 2003
Mark's Blog: I’ve been extremely busy here at Microsoft and so haven’t had time to blog until now, butAnonymous
January 01, 2003
PingBack from http://www.entelliblog.com/?p=10537Anonymous
January 01, 2003
I was in Barcelona a couple of weeks ago speaking at Microsoft’s TechEd/ITForum conference, where I deliveredAnonymous
January 01, 2003
PingBack from http://www.entelliblog.com/?p=4966Anonymous
January 01, 2003
PingBack from http://www.entelliblog.com/?p=7268Anonymous
January 01, 2003
PingBack from http://sharpreflections.wordpress.com/2006/08/31/the-kernelbreaking-work-of-a-staggering-genius/Anonymous
January 01, 2003
原文 作者 Mark Russinovich 翻译 作者 盆盆 [ITECN站长] ITECN博客 http://blogs.itecn.net/blogs/ 盆盆 导读 本文由 Mark RussinovichAnonymous
January 01, 2003
PingBack from http://www.entelliblog.com/?p=11992Anonymous
January 01, 2003
Я тут ранее упоминал про идентификацию проблем и оптимизацию загрузки Word'а http://ivbeg.livejournal.com/30511.htmlAnonymous
August 31, 2006
As usually great comments, but something's missing, don't know what :(Anonymous
August 31, 2006
Every time I've had these kind of delays, I just shut off Defender and they go away. Not quite as scientific, but there you go!Anonymous
August 31, 2006
Although I understood about 1% of that, it's
good to have you back!!! Microsoft has a 'new
foundation' with you two guys there!!!Anonymous
August 31, 2006
The comment has been removedAnonymous
August 31, 2006
The comment has been removedAnonymous
August 31, 2006
nice article to read :)Anonymous
August 31, 2006
Another nice detective story.
Thanks Mark.Anonymous
August 31, 2006
Great article.
A small comment on the blog itself though.
Why are all the images .aspx pages (rather than .gif, .jpg etc.)? It prevents my RSS/Atom -> Outlook converter from getting them ;-(Anonymous
August 31, 2006
What's that cmd.exe font?Anonymous
August 31, 2006
Mark, I am frequently humbled by your blog posts. I'm fairly good with Windows and I have a basic idea of what makes it work. However, my knowledge doesn't even compare to yours. Keep up the good work. MS is lucky to have you...whatever they are paying you, it isnt enough!
~Aaron~Anonymous
August 31, 2006
More great Windows sleuthing, Mark.Anonymous
August 31, 2006
I am not worthy!Anonymous
September 01, 2006
Interesting research Mark! I've always suspected that Windows Defender does a whole lot of extra procedure calls, but I have more problems with our CM software, that does similar actions.
I think this was one of the most technical blogs I've read so far - and by far the most easy to follow...so keep up the good work.
Will follow this blog with interest from here on!Anonymous
September 01, 2006
Thanks for the great post! It got me to do a bit of investigative work on my own. I've seen this on a few of other blogs on Technet too.
The first think I see when I view this page (from Firefox) is popup with the title "Website Certified by an Unknown Authority". Why would it involve SSL was my first thought, it turns out that all the pictures in the post are linked by https instead of http.
So how come these images are protected by ssl. I have two theories.
a) It's so people who are using Internet Explorer and are viewing the site https://blogs.technet.com/markrussinovich/ isn't bothered by the message "The page contains both secure and nonsecure items, do you want to display the nonsecure items?"
b) Mark was logged in to his blog authoring software through ssl and just copied the link to the image which in that state was an ssl link.
I'm favoring theory b, I don't think that many people are viewing the site through ssl.
But the mystery itself, why do I get this popup? The certificate for blogs.technet.com chains to a Root CA called "GTE CyberTrust Global Root".
I open up the certificates in Firefox and it does indeed have a trusted root certificate for GTE CyberTrust Global Root. Strange indeed, so the chaining process is failing for some reason. The AIA information seems to be in order too.
Before I fire up Wireshark to see what's happening I do a quick google search, and sure enough Larry beat me to it (http://blogs.msdn.com/larryosterman/archive/2004/06/04/148612.aspx) over two years ago. It turns out that Firefox doesn't follow the OID 1.3.6.1.5.5.7.48.2 when doing certificate chaining.
At the end of Larry's post he said "Now all someone has to do is to file bugs against Mozilla and OpenSSL to get them to fix their certificate validation logic" I guess nobody did, or they ignored it.
But I'm hoping Mark can verify if theory a or b was the correct one.Anonymous
September 01, 2006
For looking up error codes, have you checked out DavidChr's err.exe? It's available on the intranet on that one server that I won't name publicly that has all the tools. It's also available on the internet downloadable from Microsoft, albeit probably in a slightly older version than the one posted internally. I use it about as frequently as I use SysInternals tools. That's often. My only gripe is that it doesn't include CLR exception codes yet.Anonymous
September 01, 2006
I thought, they would be using Linux there at Microsoft ;-) So they use Windows themselves...
SCNRAnonymous
September 01, 2006
Good point, well made, Jack!!!
The RMC delays on my PC are at best, annoying.Anonymous
September 01, 2006
But why does the hook need to pass the SID anyway?Anonymous
September 01, 2006
Kudos, Mark with you to track down issues like this maybe Microsoft product quality can only go up :)
As far as the case of the certificate popup appearing in Firefox goes I saw Patricks comment and followed his trail to Larry's blog entry (http://blogs.msdn.com/larryosterman/archive/2004/06/04/148612.aspx). I went through the comments there and sure enough a bug was indeed filed in Bugzilla (https://bugzilla.mozilla.org/show_bug.cgi?id=245609). However, it seems there is some question regarding the RFC involved (http://www.ietf.org/rfc/rfc2459.txt) and its validity.Anonymous
September 01, 2006
Mark, so NetrLogonGetTrustRid on the client side (PE in this case) performs RPC to a function of the same name in LSASS.exe? If so, is this just a convention used by NETLOGON when doing RPC? Or is this how RPC works in general? (eg. client side stub RPC's to real function of the same name on server side).
thanks,
MarcAnonymous
September 01, 2006
Great work, any opinion on this bug as it pertains to the state of windows in general? What I mean is how do you feel about all this code injection surface area that you guys expose in autoruns?
It has been bugging me more and more how windows has so many places code can be placed to startup with the system which seems to have led to the spyware epidemic and now the anti-spyware causing thier own problems.
My point is it seems the solutions are poor bandaids (good autruns fighting bad autoruns causing more running process and recource usage over all), when this stuff should be baked into the design so that 1. There is a consistent well known place for any startup code to be placed (perhaps services and startup folder only). 2. If a program trys to install anything there the user is warned explicity. 3. If the user wants to remove the startup they only need look in 2 places(1 would be ideal) and simply disable or delete startup they don't like.
Now that definetly only helps part of the mainy issues involved in spyware, but its sorta like buffer overuns, autorunners are probably 90% of the problem, and windows defender is like trying to use some C library or code analyzer counter overruns when you should be making a new platform (.net) that inherintly doesnt have those issues.
Would nice to hear your perspective, as autoruns really brings it home when looking at all the obsucre places startup code can be placed and you guys made it.Anonymous
September 01, 2006
The comment has been removedAnonymous
September 01, 2006
Why has the old blog vanished? At least move the posts here...Anonymous
September 01, 2006
Excellent article, truly informative. Thanks for the great tools and information. Your knowledge has helped on countless occasions.Anonymous
September 01, 2006
It can be a help to get notified when you are logged on with cached credentials. It's a reminder that something is different about this session when problems present with consistency, but intermittently.
HKLM...Winlogon
"ReportControllerMissing"=dword:00000001
HKCU...Winlogon
"ReportDC"=dword:00000001
See
http://support.microsoft.com/kb/242536/Anonymous
September 01, 2006
Mark said: "The initial stack trace only went up as far as the NegotiateTransferSyntax frame, but there were obviously other frames that the symbol engine couldn’t determine. The stack display went further when I had hit the breakpoint I set in OpenLpcPort"
On my windbg when the stack frames are all there for the walking (ie, nobody used EBP for their own scratchpad reg) then whacking the windbg stack backtrace More button will walk further w/o having to unwind the nested calls first. Of course this was 64-bit Windows with which I'm far less familiar (what is that, rbp for amd64 / em64t ? or something completely unintelligible for Itanium ...).Anonymous
September 01, 2006
Just wanted add -
The only reason you didn't see MpShHook.dll on the first callstack (before the breakpoint is set) is because WinDBG only shows the first 20 "frames" by default. If you click the "more" button shown in the upper right of your screen shot you'll see more frames and would see the MpShHook. Breakpoints and registers sounds like more fun though! :-P
Power to the debugger!!Anonymous
September 01, 2006
Ah, thanks. I never realized that was what the "more" button was for. I haven't done much debugging where the stack was more than 20 frames deep or I needed to look up a stack that far.Anonymous
September 01, 2006
I had the same delay problems, but i don't have the windows defender installed. I removed the domain integration and everything runs fine now.Anonymous
September 02, 2006
Good article, but I'm afraid that domain issues when the laptop is not connected are far too common with Windows. We have a few sales guys that come in the office complaining about slow login times. And indeed, login times can take 2 to 3 minutes when not on the domain. Tried many suggestions on the Internet, but I just ended up having them hibernate the laptop to get around the slow login time.
I can even get it to the point where all the event logs are clean and pristine and the login time is still really long. Mind you this only happens on a handful of laptops, not all of them.Anonymous
September 02, 2006
Mark,
I love your articles! I'm learning so much! It's nice to see you posting again.
I ran into a similar issue with XP Home locking up on my Dad's pc as soon as he hit his desktop.
Using lessons (and your tools) I tracked the issue down to the wuauclt.exe process overloading his CPU.
I've got the troubleshooting experience blogged over at <a href="http://grandstreamdreams.blogspot.com/2006/09/thawing-xp-system.html">Thawing an XP System</a>
Any chance you could take a look into this issue down the road? Or maybe you've run into it already. I'm not on the elite level of process thread breakdown like you!
Cheers!
--ClausAnonymous
September 02, 2006
I have learned from your web site zalot. A process context switch involves extracting that address requested from the Kernel Environment protection block and dumping it into the cr3 register of the cpu. The 1024 system file directories that form the mazin system file director has to page from the physcial and the linear. Address space laid out and a thread created. Hoevever, with explorer, the extra attention from the cpu, and subsytem environment only slow that loader that has read the Ds:DX header file so the app, or whater the process ( and relational dll) must have the DLL export the functions that the executable imported. Because of OLE 2, macro and stdin recordings, further user input may divert kernel code. Perhaps the symbolic files that represent the process for pdb info could load into the WinDbg. Changes in functionlity may also involve those cpu instensive sniffers, newtork traversal tools. Or it had an updated hotfix and the function signature differs and therefore does not respond to the API calling it. The, perhaps USER API calls the User32.dll, but is rerouted and packaged as an LPC and routed to the csrss.exe for processing ( before the overhead issue).Anonymous
September 03, 2006
I too have noticed this issue when not connected to the domain. I am not using Defender here either and this is on XP Pro 32-bit.
I have never actually tried to pinpoint the problem because there are just too many things running related to A/V, VPN, etc, but as a general rule, at least in my situation, a reboot and login with cached credentials typically fixes it up for me.
This seems to happen more often when I'm logged into the domain at work, then hibernate, then power on and connect to my home network (from hibernate state).
Nice article though..easy to follow and understand. I firmly believe Microsoft will only benefit by having your skills supporting them behind the scenes.Anonymous
September 03, 2006
The comment has been removedAnonymous
September 03, 2006
Thank you Mark for this article, and any other you share with us too. Regards from Poland.
g-n-d.netAnonymous
September 04, 2006
Why does Windows Defender hook into ShellExecute to monitor starting processes, rather than CreateProcess?Anonymous
September 05, 2006
The comment has been removedAnonymous
September 06, 2006
@ nobody
"Why does Windows Defender hook into ShellExecute to monitor starting processes, rather than CreateProcess?"
My guess is because ShellExecute opens files whether they're executables, documents, or folders. Defender sees more by hooking ShellExecute than CreateProcess.Anonymous
September 06, 2006
I have seen this same process start delay on the Tablet PC edition of windows. killing windows defender resolves the performance problem for me.Anonymous
September 10, 2006
i see the same thing on my laptop too
XP PRo 32, NOD32 and SpySweeper loaded.
when not on the DOMAIN, i need to delete any unconnected resources too.
I use batch files for thatAnonymous
September 11, 2006
Anyone got any idea why the RSS for this page isn't recognised by Bloglines. I can't get it to subscribe to this blog.Anonymous
September 12, 2006
Great post!!! The CEO of my company has called me at least once a week for his slow start up for two years! He does not run Defender though.
Any chances of you running with this further and finding other causes in XP?
What about a patch to fix theis at some point?
Thanks,
Dalton Williams
EVP & CIO WestStar BankAnonymous
September 12, 2006
Just thought it was puzzeling that Micrsoft would let you join your personal laptop to their internal network, or maybe they don't know. Nice post! I think I'll be reading your BLOG more often.
Take care,
ZachAnonymous
September 17, 2006
another great post Mark!
hugs from BrazilAnonymous
September 18, 2006
Great job, but :
- I use XP Pro 32 bits
- I'm not connected to any domain (standalone laptop)
- I don't use Defender
and I frequently experience such long and boring delays when I start processes.
So, where is the problem ?Anonymous
September 18, 2006
There is a very interesting task that only Mark could implement: to undercover a delays nature by Windows booting. I am afraid it would need a new early activating tool to bring a light splash on the boot sceneAnonymous
September 19, 2006
The problem exists since a very long time, prior Windows Defender. I experienced it with Win NT 4 and Cygwin utilities few years ago. Some programs were performing pointless inits, which were causing long delay when disconnected from the domain.Anonymous
September 20, 2006
Hmm, I wonder if everyone is comfortable with Mark revealing internal MS's domain SID...Anonymous
September 20, 2006
Drew, are you referring to Microsoft Exchange Server Error Code Look-up?
http://www.microsoft.com/downloads/details.aspx?familyid=be596899-7bb8-4208-b7fc-09e02a13696c&displaylang=en
If yes, it's v06.05.7226 (5/24/2004) so it's probably an old version. Could you by chance ask somebody to update this download with latest version?Anonymous
September 21, 2006
I usually DON'T JOIN any CORPORATE Domain.
I'll just login only to change password...
And everything runs fineAnonymous
September 21, 2006
The workaround I have our laptop users use is, do not plug in your laptop / turn on wireless, until you see the login screen when not at work (connected to the domain).Anonymous
October 17, 2006
Good work, Mark, and great explanation. This accounts for one of the issues I have had recently. However, as quite a few folks have noted here, there are issues that occur in 32-bit XP and 2003, (more so in XP), without the use of Defender, regardless of domain connectivity/status. Looks like I'm going to have to do the work and uncover them, because they are annoying...Anonymous
October 30, 2006
we need and appreciate people like youAnonymous
November 14, 2006
It's no wonder we've got more cycles then ever before and we're still waiting for our computers. Blooooaaaaat. What ever happened to K.I.S.S.? Don't forget this all started to run a text editor! Geewiz.Anonymous
November 16, 2006
I have a Thinkpad laptop with Windows 2000. When I am connected to my employer's network, then standby and go home, when I start working again I notice a long (30-60 second) delay for applications to start. Recently I discovered that if I unmap all network drives when I am at home, then my applications start almost immediately. Not sure if this is related to your problem.Anonymous
February 14, 2007
FYI, the reason this only affects Windows 2003 and not Windows XP is to do with the RPC_SECURITY_QOS_V3 structure introduced with Windows 2003 that adds a Sid field: http://msdn2.microsoft.com/en-us/library/aa378649.aspxAnonymous
February 15, 2007
The comment has been removedAnonymous
March 31, 2007
Way to go, Mark! This was a great read.Anonymous
April 24, 2007
Thanks for the great article! Too bad they appear very rarely though..Anonymous
May 15, 2007
It's extremely interesting, educational & impressive for me (without programming background) to read how Mark gets to the bottom of the inner workings and shortcomings of Windows. I think Mark explains some complex points very effectively (even I understand a lot of it). A beginners guide to the basics, if there are any basics!, of the Windows kernel would be great from Marks perspective. I hope working at Microsoft won't stop Mark from discussing the bugs in these OS's to the general public. Great articles, very dedicated.Anonymous
July 23, 2007
The comment has been removedAnonymous
June 06, 2008
Hi Mark! I tried to reproduce your debugging steps on my own laptop, but I stumbled on a problem early on. You said that you started by attaching Windbg to PROCESS EXPLORER and then you launched Notepad with PROCESS EXPLORER's Run dialog. If I attach Windbg to PROCESS EXPLORER, the PE window is frozen and I have no access to its file/run dialog. On the other hand if I attach Windbg to WINDOWS EXPLORER, PE window is not frozen and I can open notepad from its file/run dialog. Also the PID number (3460) which appears in you first Windbg's Calls window picture makes me think that Windbg was attached to WINDOWS EXPLORER rather than to PROCESS EXPLORER. Could you please confirm which process Windbg was attached to in your first debugging step? ThanksAnonymous
November 23, 2008
Hi, I faced the same delays in Windows XP. It takes 5-6 seconds to open any app from run dialog. Does this explanation fits there too? Thanks