SharePoint 2007 and TokenAndPermUserStore Issues

Article
07/12/2008

In 2007, I was working on a large SharePoint 2007 project and we discovered that after the system ran for 3-4 days (or 24 hours of simulated stress), performance would suddenly degrade significantly (i.e. minutes for pages to load, if they would load at all). However, when we looked at all of the traditional performance indicators (memory, CPU, network throughput) on the web front-ends, application servers and database server, everything appeared healthy.

We ultimately tracked the problem down to a SQL Trace Flag 4618 needed to regularly purge the TokenAndPermUserStore cache.

This post discusses the symptoms and ultimately simple solution to the problem.

Configuration

Hardware

4 web-front ends
Windows Server 2003 SP1 (32-bit)
2 process web garden
2 clustered database servers
SQL Server 2005 SP2 servers (64-bit)
"lots of memory"
Search was handled outside of SharePoint so there were no dedicated application servers

Usage Pattern

3 Site Collections
2,000 Publishing Sites
~18,000 Publishing Pages

Symptoms

Significant performance degradation after 3-4 days of normal use or 24 hours of simulated stress
Pages would take minutes to load or would simply time-out
Web front-ends
- Low CPU usage
- Normal memory usage
- Normal network throughput
Database servers
- Low CPU usage
- Normal memory usage
- Normal network throughput

In short, there were no obvious indications of a problem other than the slow response time...

Diagnostic Approach

On this project, I had the pleasure to work with a Microsoft Architect named Jay Gore. When Jay executed the following query against the SQL Server:

SELECT scheduler_id, current_tasks_count, runnable_tasks_count FROM sys.dm_os_schedulers WHERE scheduler_id < 255

The query returned results such as:

 scheduler_id     current_tasks_count       runnable_tasks_count
0                      19                                 16
1                      35                                 31
2                      36                                 31
3                      33                                 29
4                      33                                 29
5                      35                                 31
6                      33                                 30
7                      35                                 31

A runnable task count of greater than zero for any substantial length of time is a problem. The observed behavior was that the database was only at 25% CPU usage but had 10 to 25 tasks in runnable state per scheduler and it was spending all it’s time waiting on SOS_SCHEDULER_YIELDS with greater than 90% signal waits vs. resource waits.

On a hunch, Jay cleared the caches (dbcc freeproccache) and suddenly performance returned to normal.

After further diagnosis, Jay traced the problem to the TokenAndPermUserStore and he was able to consistently resolve the problem by issuing the following command:

dbcc freesystemcache ('TokenAndPermUserStore')

Root Cause

By design, if SQL Server 2005 has "plenty of memory", the cache of dynamically compiled queries will continue to grow and "old items" will not be automatically purged. Eventually, the cache can grow so large that it takes a substantial amount of time to locate the appropriate compiled query.

This behavior is documented in the following Microsoft Knowledge Base articles: KB933564 and KB927396.

Resolution

To resolve this issue, we applied the SQL Server Trace Flag 4618 to the startup parameters for SQL Server 2005. This flag forces "old items" to be purged from the cache immediately.

Partager via