Modeling servers for resource consumption

In the past few years, I have helped various groups deal with server problems related to resource usage. A typical scenario goes like this “we have a server we are about to release. It does great on our internal stress tests – generally composed of a server under an artificial load of multiple client machines – and when we put it online, it runs great for a while and then it crashes or stops working because it runs out of resource X – memory, database connection, whatever finite resource you can imagine –“ In general these folks have been very good at making their server CPU efficient: at the maximum load specified the CPU usage on the machine under say 80%. Indeed, when things start to go bad, the CPU is far from pegged.

Let’s take the specific example of a CLR based POP3 mail server. The team did a great job to make it efficient and simple. On a 4 CPU machine, all CPUs showed the same 75% load at the specified maximum artificial load. The memory usage was about 100MB (the machine had 512MB of memory and no swap file). When deployed online this server would last anywhere between a few days and 2 weeks before crashing with an out of memory exception. Looking at the full memory dump collected at the time of the exception, we could see that the heap usage was very high, with a lot of very large byte arrays. We traced them back to the actual contents of the mail messages retrieved from the email store and on their way to the user’s mail client. These messages were a lot larger than the average mail payload (less than 20KB) and they were a surprising amount of them, enough to drive the server process out of memory. Actual request measurement told us that very few mails will be large, yet when the server crashed, we found a lot of them. This was the clue to understand the problem: small mail messages were sent quickly and all of the memory allocated by the request was quickly reclaimed by the garbage collector. In house stress tests showed this worked well. The requests for large pieces of mail were the problem, a few of them got sent quickly but some of them were sent to dialup users so the actual time for the whole request to be over was very large compared to the average requests. Soon these lingering requests accumulated and the memory got exhausted. I asked the server team to show me their load specifications: They counted an average request size (this is for all of the objects in use during a request) of 20KB and an average of 4000 requests in flight. The heap size should therefore be around 80MB. They told me that the maximum payload size was 4MB (I think I remember that it was limited by the server, which wouldn’t accept single email pieces larger than that). A quick calculation shows that if the server was unlucky enough to serve only the largest email pieces, the heap size would have to be 16GB to accommodate 4000 request in flight. Clearly that’s not possible on the CLR platform for Win32. Therefore if it was possible to crash the server by requesting enough large pieces of mail. By accepting and starting to process all of the incoming requests, the server have overcommitted and died. This actual workload is hard to replicate in house because generally, labs are not equipped with slow communication lines such as dial ups.

 

 

Writing a CLR based simulator for servers.

It turns out we can model these resource usage variations and simulate them fairly simply:

At heart, a server takes requests, create some memory and ties up some resources, sends a response back and flushes the request. The lifetime of the request, the resource usage and the incoming rate are what makes the server go into an overcommitted situation.

We can model this simply by allocating and storing some object simulating the requests into a queue at every “tick” of the simulator. This simulates the incoming requests. The size of the queue represents the number of requests in flight in the server. Now the requests need to be processed and a response needs to be written back. To simulate the processing , you allocate some unrooted (temporary ) objects and some objects rooted in the request object. They represent the memory consumption of the request. You can randomly vary their sizes according to the profile of the server load you are simulating. To start simply, you can do this allocation in the same step as when you queue up the request. You can also allocate whatever other resource you need to model (database connection, etc..). A more sophisticated processing would entail assigning steps or states to the request objects and have them go through a series of steps or state transitions – each with some processing and resource consumption- to reach the final state: the response has been written out. To simulate the final state, you can remove the requests in semi random order. To simulate the POP3 server, the requests with the smallest simulated response would be removed first and the ones with the largest responses would stay in the queue for more ticks. This super simple model will reproduce the instability of servers without a throttling mechanism. It is helpful in evaluating real throttling mechanisms in a very simple environment. In another post I will talk about some server throttling mechanisms.