Partilhar via


Started means started, not completely ready.

Windows is extensible. Extensible means we have places that are built for external software writers to plug in to provide additional functionality that was not included by default. This makes Windows very powerful as a platform. This comes at a price. In any plug-in model, you have the option to document the contract or to document the contract and ENFORCE it. For the most part, Windows does not enforce these contracts.

Let’s walk through an example. On Windows we have the concept of services. Services are apps that run in the background, automatically even if the user is not logged in. These services can have a few different startup options, but a common one is Automatic. This means the service will be started early during the boot of the system. Each of these services is asked to start, and as soon as they are started, they are supposed to report back started. Here is the contract for a service and how long they have to go from “Asked to start” to “Started” @ Writing a Servicemain Function (MSDN):

Excerpt from this page:

The sample initialization function, SvcInit, is a very simple example; it does not perform more complex initialization tasks such as creating additional threads. It creates an event that the service control handler can signal to indicate that the service should stop, then calls ReportSvcStatus to indicate that the service has entered the SERVICE_RUNNING state. At this point, the service has completed its initialization and is ready to accept controls. For best system performance, your application should enter the running state within 25-100 milliseconds

I have highlighted the key line. The contract, despite being delivered softly, is “Dear service, you have 25 milliseconds to get to the started state”. The problem is the service writer’s perception of what “Started” means. Our intent, was “Started” means you have crossed the start line. Many developers believe this means their service has reached a point where they feel they are FULLY operational. Why does this matter? Well, automatic services are started in a serialized manner. First the Service Control Manager (SCM) reads the list of services out. It first orders the list based on each services LoadOrderGroup,then each group is re-ordered them based on dependencies. To see the outcome of this sorting, you can use  LoadOrder from Sysinternals. So we have a list of services, divided in to groups, that are reshuffled to take care of dependencies. The list ends up being, for the most part, alphabetically sorted based on the service name.

If a service that is marked as automatic decides that he is going to take a long time to start, but while doing this, continuously reports pending back to service control manager, then all the services that have not started yet, have to wait. In addition, any request by anything already running to start a service is blocked too. So a service named “AAA Service” that is marked as automatic, begins is startup, and decides to do a lot of work, or connect to something over the network, or any other bad decision, while still reporting back to SCM pending has blocked the box from allowing you to have a quick boot.

 

I have seen this as a frequent cause of hangs during boot. Here is a diagram from a case I worked where a customer hit this issue:

Issue Diagram

 

The bad behavior is the contract violation of the aaaService. As the end user, you are stuck staring at a logon screen waiting to get to your desktop because the application that collects your credentials, Winlogon, called a function that required the NLA service to start. Since SCM is blocked waiting on the “bad” service to report started, Winlogon’s request is blocked.