Follow-up and solution to the serious performance issue with multiple audiences

A little while ago, I worked on a performance problem for a WCM portal.  We reviewed a lot of parameters, page output caching, object caching, SQL queries, web parts used and their parameters, code review for the custom controls and web parts, etc.  I described part of the issue in my previous post at https://blogs.msdn.com/maximeb/archive/2009/02/03/multiple-audiences-on-a-targeted-content-may-lead-to-page-output-cache-flush.aspx.

The context was for a WCM Intranet where 3 WFEs are handling requests for roughly 30,000 users.  The WFEs are either sleeping or at 100% CPU, there were no in between.  The content is personalized, targeted, and profiled and users are authenticated using NTLM.

We found several little things such as there was a missing SQL Index on the user profiles table after installing SP1, optimized Object Caching for longer periods of time and augmented amount of RAM allowance, optimized Page Output Caching to remove “check for changes” and have a 2 hours caching period.

Unfortunately, it “felt” as if the Page Output Caching wasn’t always working at random intervals.  Note that debugging Page Output Caching is tedious, especially if you have multiple WFEs responding.  The only built-in tool is to add Debugging Information that will appear if you view the HTML source at the end of the page.  This will include which caching profile was used and at what time the cache was made; or it will include why caching wasn’t used.

With multiple WFEs in load balancing (and no affinity), you will have different cache time because you will sometimes hit a different server.  What we did was to add the server name in an HTML comment through the master page.

With these 2 information available, I created a (too personalized for the customer to share yet) Windows tool that was creating HttpRequests with multiple users to various page in a site.  The tool was reading the server name and caching information.  At first, in a testing environment, we could see that the cache was working for 2 hours, as it should.  However, when we downloaded content from production to do the same test, we could see a pattern that the cache was sometimes invalidated for all users.

It took us a while but we got a lucky hit: the main difference is that one of the Top Navigation item was targeted to 20+ audiences (while our testing environment only had a single audience for that link).  Let’s say you have the following scenario where the cache profile used does have the ‘Vary by user rights’ parameter checked:

A navigation link (NOTE: you can target a navigation link, Web Part, or document/page;  the concept applies to all of these) is targeted to Audience1 and Audience2.

  • Audience1
    • User1
  • Audience2
    • User2
  • User3 is not in any audience

 

The expected display results is the following:

  • User1 and User2 will see the targeted navigation link and the non-targeted navigation links
  • User3 will only see the non-targeted navigation

 

This means that the page should only have 2 instances of page output caches, one for each 2 display results.  (Note: I simplified the scenario but if you have 2 audiences, it’s possible that the same page also contains targeted content for that audience only, which will in turn create a 3rd display set.)

Unfortunately, the following was happening (before the cache expiration of course):

  • User1 hits the page, it will create the page in cache (ok)
  • User2 hits the page, it will reuse the page in cache (ok)
  • User3 hits the page, it will create the page in cache (ok)
  • User1 goes back to the page, the cache is recreated again (wrong)

This means that the once a page is accessed and cached, only users in the same “display result set” (or bucket, as coined by the support team) will use the cache.  Any other access to the page will flush ALL caches for that page

 

If you only have 2 display bucket, and one bucket only contains a few users, it means that the impact may not be seen.  Once your buckets contains several hundreds/thousands of users, the problem becomes more apparent.  Also, if you have multiple audiences in a single page in a sub-section, the problem may not be seen.

In our case, it was in the Top Navigation bar used by all pages in the site, making the Page Output Cache useless and in fact a burden.

In this scenario, it was a single link that was being made available to groups of users in blocks.  An application was being deployed locally to geographical location and each time one of those location was receiving the application, the link was updated to be targeted to Active Directory users of that location.  Over time, the display buckets contained several thousand users.

What we noticed was that the issue with “display buckets and audiencing” only occurs when you have multiple audiences set to a single item.  If you only have a single audience, it works correctly. 

Our workaround was to create a MOSS group that contained the list of Active Directory groups and targeted the content for that MOSS group.  Mind you, if you plan to have multiple targeted items with the same list of people, you should do this anyway, but in our case, it was the content team’s choice and it was only used in once place.

 

Fix coming in!

I’m also very happy to know that a hotfix was approved by the product group and is being implemented for an upcoming Cumulative Update (CU).