Overlapped Recycling And SharePoint: Why SharePoint Requires It

This is one of the topics I wanted to inject into the discussion that is not in the generic overlapped recycling documentation. There are at least two very good reasons why I believe SharePoint requires the safety net provided by overlapped recycling.

Reason One: SharePoint Has No Throttle

This is causing us some major headaches because under the right circumstances it makes it look like SharePoint is leaking memory or caught in a 100% CPU loop. Really, all it’s doing is what you asked it to, which is to handle all incoming users requests. Every user request to a web application, SharePoint or some other, has a cost associated with it in terms of memory and CPU. As user load and features increase, the bill gets higher and higher, eventually causing the server to run at a deficit in one of these categories and fail.

To illustrate why administrators don't know when this is happening in SharePoint, let me ask a question; What is the error message that SharePoint throws to indicate user load has reached the maximum for the server? That’s right, there isn’t one. SharePoint will happily continue accepting user requests long after it has entered the danger zone with respect to resources. Without overlapped recycling being configured the server will eventually run out of resources and start to fail in odd ways. Here are some examples:

Problem Description

Common Symptoms

Comments

Likely Remedy

Running out of virtual address space

OutOfMemory exceptions in the ULS logs or browser, operations that would normally complete fail with weird or nonsense sounding errors. On 32-bit servers This problem usually manifests when the Private Bytes counter is greater than 1000MB and Virtual Bytes counter is over 1700MB.

This is why we recommend that you set the Maximum Used Memory value for overlapped recycling to 1000MB and the Maximum Virtual Bytes value to 1700MB. More conservative numbers that you will find mentioned sometimes are 800MB and 1500MB respectively. The numbers mentioned in what I write here and in our recent whitepaper are targeted at customers trying to get the most out of their 32-bit hardware. Your mileage may vary.

Move to 64-bit. If you can’t do that then you will probably need to add more Web Front Ends or possibly even do both. There is a point of diminishing returns on all of these and at some point you will simply have too much load and you need to consider splitting your users into multiple farms. Our current documentation states that you will get your best performance with 4-5 web front ends per SQL server.

Running out of physical memory

Users complain of poor performance, administrators notice high CPU utilization and disk activity.

The poor performance complaint is caused by the fact that the server is spending all of its CPU time paging memory to disk rather than serving user requests. This is also what drives the high disk utilization.

Add more physical memory. Once you have enough you may start running into some of the other problems on this list but at least this one is easily identifiable and resolvable.

Fragmentation of the Virtual Address Space

The outward symptoms are very similar to running out of virtual address space; OutOfMemory Exceptions in the ULS logs or browser, operations that would normally complete fail with weird or nonsense sounding errors. This problem is easily identifiable by the fact that the value for Private Bytes and Virtual Bytes for your worker process will be significantly less than the maximum for your architecture at the time of the error.

If you are experiencing these symptoms you can usually validate that your problem is fragmentation with a very simple test, recycle the worker process. If all the symptoms go away and then reappear later with similarly low values in Private Bytes and Virtual Bytes then you are probably suffering from fragmentation and should open a case with Microsoft Support to discover the root cause.

Because you usually will not hit the Maximum Memory Used limit or the Virtual memory Used limit prior to failing due to fragmentation you should be able to get immediate relief by implementing scheduled, time-based, or request based recycles of your worker process. This will reorder the address space and relieve the fragmentation temporarily. Long term you will have to discover the cause of the fragmentation and fix it.

Unfortunately, people not familiar with how a SharePoint Web Front End behaves when it is overloaded tend to think it’s leaking memory. This is not unreasonable when you think about how a memory leak in IIS typically manifests itself, by a steady increase in consumption of committed memory over time. Committed memory in a process is represented in Performance Monitor by the Process/Private Bytes counter.

Many people demonstrate that SharePoint is leaking by disabling all of the overlapped recycling settings and then using Performance Monitor to track the Private Bytes counter for the worker process until it crashes. They will then show a steady upward trend in the Private Bytes until the time when the server started to fail, declare that SharePoint has a memory leak and demand that we fix it immediately.

A more complete analysis would have included a look at user load and SQL performance over that same time. In most cases, if they did that, they would have seen that the increase in Private Bytes was tightly coupled to the increase in user requests or that the SQL server has 100% CPU utilization or is experiencing severe query blocking.

In some scenarios there is no evidence of increased user load because the administrator only discovered the problem after the load was already too high. The typical behavior in this case is that the worker process recycles or crashes and as soon as the new worker process comes online, it’s memory footprint rockets up again and the cycle repeats until the server is taken offline or the users give up and go away. In these scenarios, the server will be OK until some time the next day when the load hits a critical level again and then it will start behaving this same way.

In the overloaded server scenario people also get confused when they disable the overlapped recycle settings and the process runs for another hour before it starts crashing in bizarre and often random ways. Their reaction to this is to assume that SharePoint is broken, that the crash is inevitable and that the overlapped recycling is just getting in the way of their troubleshooting. This could not be further from the truth.

In reality ASP.NET uses the overlapped recycling settings to do some performance management tuning. By disabling these settings they are actually making the problem worse. I’ll cover exactly what that ASP.NET tuning does in a later post.

 

Reason Two: All applications have flaws, SharePoint is no exception

Virtually all applications that are under a constant update cycle at some point will suffer from problems like memory leaks, fragmentation, memory corruption, etc. As developers you generally guard against these things by implementing very strict change control processes, peer code review before check in and stringent patch testing. We do all of those things and more in SharePoint. So why does SharePoint suffer from some of these things from time to time? Because we don't own all of the code that runs in our worker process.

To begin with we leverage code from a variety of sources just to produce SharePoint. Here are a few of this bits of code we get from other sources:

·         Shared Office components.

·         MDAC – Microsoft Data Access Components

·         .NET Framework

·         ASP.NET

·         SQL Server

·         Windows

·         Active Directory

I’m sure there are many more but I think that makes my point. These components are written by hundreds of developers, each with their own design specs and expectations about how that code will be implemented. If they make a single mistake in that code or we make a mistake in how we implemented it, that will eventually show up as a problem in our worker process. There is also one huge hole that allows us to suffer from these problems even when we don’t introduce them ourselves, extensibility.

SharePoint is designed as an extensible platform. The most common extensibility point that causes us to fall victim to these problems is custom web parts. You only need to read the content we have developed to help developers do a better job when writing code against the SharePoint object model to know that it’s really easy to get it wrong…and they still get it wrong… a lot! Here is some light reading on the subject for your reference:

Best Practices: Using Disposable Windows SharePoint Services Objects

Best Practices: Common Coding Issues When Using the SharePoint Object Model

One of the common custom web part patterns that is prone to high memory consumption problems and memory leaks when using the SharePoint object model is navigation controls. Due to the need for these controls to iterate through the entire structure of the sites in order to generate the navigation control, one tiny coding mistake can be amplified hundreds or thousands of times. This often results in huge numbers of leaked objects. Each SPSite or SPWeb object consumes ~2MB of memory. You can see how this could very quickly take down your server.

Let’s not forget anti-virus vendors and IFilter vendors. These too can contribute to bad behavior in our product and there is almost nothing we can do to prevent it.

The point of all of this is that no matter how good of a job the SharePoint development team does to avoid introducing these types of problems, they will probably creep into our code at some point. Even if we don't add them, someone else will.

A good analogy for my request that you run with overlapped recycling all the time is homeowners insurance. Do you go buy homeowners insurance when it looks like a storm is coming? No. You carry homeowners insurance all the time because you don’t want to be left holding the bag if a storm should unexpectedly come along and destroy your house. You should view overlapped recycling as a form of insurance against other peoples coding mistakes and run it all the time, not just when it looks like there is bad code on the horizon.

Since we’re on the subject of bad web parts, if you suspect that you have a web part that is not following the best practices referenced above, a simple way to test that theory is to uninstall the web part and see if your memory consumption goes down or the memory allocation errors go away. If this is not possible, you should open a service request with Microsoft support to assist you in troubleshooting this problem further. As I’ve said, these are not an uncommon problem and for the most part if you tell the engineer you suspect this is the problem they can usually confirm that with a memory dump of your process. This can also be accomplished with a code review of the web part, assuming you have the source code.

In closing, if you are attempting to troubleshoot a problem like those described in this post and you intend to engage Microsoft Support you will be doing yourself a huge favor if you gather the data we will need prior to making that call. Here is a brief description of what you need to do:

1. Gather Performance Monitor data for 1 full day. Starting before users begin to use the system. You should recycle the worker process right after you start the performance monitor trace so we can see things from a clean start. You should include all objects and all counters using a 15 second interval. You should capture data from all of the Web Front Ends and the SQL server simultaneously. You do not need to restart the SQL server.

2. Capture a user dump of the worker process when it is above 900MB of Private Bytes memory consumption.

If you do not know how to accomplish these things the support engineer will be able to assist you when you call Microsoft Support.