Service Application Load Balancing Discoveries

Article
07/15/2012

Most SharePoint administrators know that Service Applications; for example Managed Metadata Service and Business Connectivity Services, have built-in load balancing. There are numerous blogs that explain how service application load balancing works. An excellent, in-depth blog is How I Learned to Stop Worrying and Love the SharePoint Topology Service, followed by What is the Service Application Load Balancer?

Built-in load balancing simplifies deployment and administration, but has user impacts that might not be obvious. We discovered some of these impacts recently. A customer asked us to test Service Application load balancing in their environment. We had to test both shared services (services published from a publishing farm to a trusted consuming farm); as well as, local services.

Discovery 1—Service Application Status Doesn’t Change

The Status column in Central Administration > Manage Service Applications doesn’t change when all instances of a service application are stopped. We tested stopping all instances of Managed Metadata, Business Connectivity, and PowerPoint. In each case, Central Administration continued to show the status as Started even though no service instances were running.

Discovery 2—Consuming Farm May Not Know of Failures for Up to 15 Minutes

The blog referenced above, What is the Service Application Load Balancer, explains that a publishing farm checks availability of a published service once every 15 minutes by running the Application Addresses Refresh job. This allows a 15 minute window during which the users in the consuming farm might see errors; for example, with the Managed Metadata Service management page. This might be confusing to the farm administrator, because Discovery 1 shows that Central Administration reports the status is Started.

What is more insidious is that other user facing pages depend upon the availability of the Managed Metadata Service to render correctly. An example is the User Profile update page. In this screen capture we can see the user is unable to update Ask Me About, with no direct mention of a Managed Metadata Service problem in the error text; when in fact, unavailability of the Managed Metadata Service was the root cause of this error in our testing.

A point to take away from this is if a service in a publishing farm fails; that is, all service instances have stopped, after starting one or more instances immediately, in the publishing farm go to Central Administration > Monitoring > Review Timer Job Definitions, click on Application Addresses Refresh job, and click the Run Now button to make consuming farms aware of the newly available service end points.

Discovery 3—Excel Calculation Service May Not Use Available Service Instances

So far as I know, Excel Calculation Service is the only service application with multiple load balancing algorithms. The load balancing choices are:

Workbook URL
Round Robin with Health Check
Local (only is the service is running on the WFE)

The default is Workbook URL. This default has an interesting implication. Our testing showed a browser with a spreadsheet open prior to the service instance failure, continues to try to use the failed instance even though functioning service instances are available on other application servers. Apparently the Workbook URL creates a “sticky session” with the non-functioning application server. Users with the workbook already open get errors, but subsequent users of the same workbook are connected to a functioning service instance.

This raises one of those consulting questions of what is the best load balancing algorithm. It all depends. Workbook URL load balancing improves caching, responsiveness, and resource usage; while at the same time, for this admittedly rare use case, causing an unintentional single point of failure.

Discover 4—Cached Word and PowerPoint Documents Can Mask Service Failure

While this discovery is not a problem itself, it can mask a service application failure.

In our testing, we “viewed in browser” several PowerPoint files. As you may know, Office Web Application caches rendered documents to improve performance on subsequent requests. These documents continued to rendered in the browser even after all service instances were stopped; however, uncached documents returned an error.

This inconsistent behavior of some documents rendering and some not could lead to end user confusion, and even farm administrator confusion, until the root cause (all service instances stopped) is identified.

Discover 5—User Profile Service Has Various Impacts

As most administrators know, the user profile service application allows subscribing farms to make direct SQL calls to the user profile database in the publishing farm. As a consequence, only parts of the profile service application are dependent upon the functioning of service instances in the publishing farm. The user experience is therefore inconsistent in that some features continue to work, some features fail silently, and some features displays errors. Here is a quick list of what we saw when going to a My Site of a user who had previously created their My Site, added colleagues, and had site memberships.

My Site home page renders (https://my.contoso.com/default.aspx), with these consequences:
- My Colleagues list appears, but clicking a colleague name results in an error
- My Interests and Newsfeed Settings also result in errors
- Organization Browser fails to render, but doesn’t display an error either; that is, silently fails
My Content displays the personal site home page (https://my.contoso.com/personal/user-name/default.aspx)
My Profile (https://my.contoso.com/person.aspx) results in an error
The enterprise search center page (https://search.contoso.com/pages/default.asp) results in an error
A team site page (https://teams.contoso.com/default.aspx) renders, but any attempt to click on a user name results in an error

Share via