Plan for availability (Search Server 2008)
Applies To: Microsoft Search Server 2008
Topic Last Modified: 2009-03-11
This article describes availability in general, costs and challenges for availability in a SharePoint Products and Technologies environment, and strategies and solutions that you can use in the environment. You should read this paper if your farm is running Microsoft Search Server 2008. You may want to download and print the Office SharePoint Server 2007 Availability model (https://go.microsoft.com/fwlink/?LinkId=122369) that accompanies this article. It provides a poster-sized summary of the content in this article.
What is availability?
Availability is the degree to which a SharePoint Products and Technologies environment is perceived by users to be available. To ensure availability means to ensure that a system is resilient — that is, that service-affecting incidents occur infrequently, and that timely and effective action is taken when they do. Availability strategies minimize the user perception of planned and unplanned downtime.
One of the most common measures of availability is percentage of uptime expressed as number of nines — that is, the percentage of time that a given system is active and working. For example, a system with a 99.999 uptime percentage is said to have five nines of availability.
The following table correlates the number of nines to calendar time equivalents.
Acceptable uptime percentage | Downtime per day | Downtime per month | Downtime per year |
---|---|---|---|
95 |
72.00 minutes |
36 hours |
18.26 days |
99 |
14.40 minutes |
7 hours |
3.65 days |
99.9 |
86.40 seconds |
43 minutes |
8.77 hours |
99.99 |
8.64 seconds |
4 minutes |
52.60 minutes |
99.999 |
0.86 seconds |
26 seconds |
5.26 minutes |
If you can make an educated guess as to the number of total hours downtime you are likely to have, you can use the following formulas to calculate the uptime percentage for a year, a month, or a week:
% Uptime/year = 100 - (8760 - number of total hours down per year)/8760
% Uptime/month = 100 - ((24 * number of days in the month) - number of total hours down in that calendar month)/(24 * number of days in the month)
% Uptime/week = 100 - (168 - number of total hours down in that week)/168
What availability is not
Availability is not data protection and recovery, nor is it disaster recovery, although these concepts are related, and you should have data protection and disaster recovery plans in any highly available system. Protecting and recovering data is the general business need that underlies the following specific business needs:
Keeping and being able to review more than one version of an item or site.
Recovering accidentally deleted items or sites.
Archiving data for legal, regulatory, or business reasons.
Restoring systems in the event of unexpected hardware or software failure.
Moreover, availability is not business continuation management (BCM). BCM consists of the business decisions, processes and tools you put in place in advance to handle crises. A crisis can be a local, regional, or national event, or a crisis can relate to only your business.
SharePoint Products and Technologies availability and data protection management strategies may be part of your technical BCM plan, but your overall BCM plan should be much more comprehensive, including the following elements:
Clearly documented procedures.
Offsite storage of key business records.
Clearly designated contacts.
Ongoing staff training.
Offsite recovery mechanisms.
Costs of availability
Availability is one of the more expensive requirements for a system. The higher the level of availability and the more systems you protect, the more complex and costly an availability solution is likely to be. When you invest in availability, costs include:
Additional hardware and software, often involving complex operations between software, such as custom scripts for failover and recovery.
Additional operational complexity.
The costs of attaining availability should be evaluated based on your business needs — not all solutions within an organization are likely to require the same level of availability. You can offer different levels of availability for different sites, different services — for example, search and business intelligence, or different farms.
Availability is a key area in which information technology (IT) groups offer service level agreements (SLAs) to set expectations with customer groups. Many IT organizations offer a variety of SLAs that are associated with different chargeback levels.
Note
When calculating availability, most organizations specifically exempt or add hours for planned maintenance activities.
Challenges for availability in SharePoint Products and Technologies
A SharePoint Products and Technologies deployment poses the following challenges for providing availability:
While you are applying patches or upgrading the farm, the farm is unavailable.
Index server redundancy cannot be achieved by installing the index role on multiple servers. To overcome the loss of an index server, you will need to reinstall the server and either restore from a backup, or rely on slightly stale results while search recrawls the content. Alternatively, you can use one of the techniques described in the section Availability of Search after failover to reduce the time it takes to recover search.
SharePoint Products and Technologies is not aware of SQL Server mirroring. Although we recommend that you consider using SQL Server mirroring as an availability technique, doing so requires additional automation.
When to consider availability
We recommend that you consider availability requirements as part of the core design of the SharePoint solution. You can also provide enhanced availability after the solution is deployed. Operationally, we recommend that you deploy and tune the core solution within a farm, and then test the availability solutions.
Determining availability requirements
To gauge the organization’s tolerance of downtime for a site, service, or farm, answer the following questions for the site, service, or farm.
If the site, service, or farm becomes unavailable, will employees of the organization be unable to perform their expected job responsibilities?
If the site, service, or farm becomes unavailable, will business and customer transactions be halted, leading to loss of business and customers?
If you answered yes to any of these questions, you should invest in an availability solution.
Choose an availability strategy
You can choose among many different approaches to enhance availability, including:
Fault tolerance of components.
Redundancy and failover between server roles within a farm.
Redundancy and failover between server farms.
System requirements for availability
In an ideal scenario, the failover components and systems match the primary components and system in all ways: platform, hardware, number of servers. At a minimum, the failover environment must be able to handle the expected traffic during a failover. Keep in mind that only a subset of users may be served by the failover site. The systems must match in at least the following:
Operating system version and all updates
SQL Server versions and all updates
SharePoint Products and Technologies versions and all updates
Although this article primarily discusses the availability of SharePoint Products and Technologies, the system uptime will also be affected by the other components in the system. In particular, consider the following:
You should ensure that infrastructure dependencies such as power, cooling, network, directory, and SMTP are fully redundant.
Choose a switching mechanism for the system, whether DNS or hardware load balancing, that meets your needs. Best practices for load-balancing Web servers can be found in the following articles:
Component fault tolerance
In any system, we recommend that you work with hardware vendors to procure fault-tolerant hardware that is appropriate for the system, including Redundant Array of Independent Disks (RAID) arrays. For recommendations, see Plan for performance and capacity (Windows SharePoint Services).
When planning for component fault tolerance, consider the following:
Complete redundancy of every component within a server may not be possible or may be impractical. Use additional servers for additional redundancy.
Consider component redundancy for the index server role, because the index server role cannot be made redundant.
Ensure that servers have multiple power supplies connected to different power sources for maximum redundancy.
Redundancy and failover between server roles within a farm
SharePoint Products and Technologies supports running server roles on redundant computers (that is, scaling out) within a farm to increase capacity and improve performance and to provide basic availability. Capacity and performance determine both the number of servers and the size of the servers in a farm. After you have met base requirements, you may want to add more servers to increase the overall availability of the service.
Availability within a single-server farm
The following table describes the servers and server roles in a SharePoint Products and Technologies environment as listed on the Services on Server page on the SharePoint Central Administration Web site, and the basic redundancy strategies that can be used for each within a farm.
Services on server | Preferred basic redundancy strategy within a farm |
---|---|
SQL Server |
Clustering or synchronous mirroring. Clustering is easier to implement but can be more expensive. For more information about using synchronous mirroring, see Using database mirroring (Office SharePoint Server) (white paper). |
Web servers |
Deploy to multiple servers and load balance by using software or hardware load balancing. |
Web server for medium server farms (Web application and search query services) |
Deploy to multiple servers. |
Search indexing |
Cannot be deployed to multiple servers and be redundant. You must use a different availability strategy. For more information, see Availability of Search after failover. |
Excel Calculation |
Deploy to multiple servers. |
Project Application |
Deploy to multiple servers. |
For more information, see Plan for redundancy (Office SharePoint Server).
Comparison of database availability strategies for a single farm: SQL Server failover clustering vs. SQL Server high availability mirroring
The following table compares failover clustering to synchronous SQL Server high availability mirroring.
SQL Server failover clustering | SQL Server high availability mirroring |
---|---|
Mirror takes over immediately upon failure. |
Mirror takes over immediately upon failure. |
Transactionally consistent. |
Transactionally consistent. |
Transactionally concurrent. |
Transactionally concurrent. |
Shortest time to recovery (seconds to minutes). |
Slightly longer time to recovery (seconds to minutes). |
Failure is automatically detected by database nodes; SharePoint Products and Technologies references the cluster, so failover from a SharePoint Products and Technologies perspective is seamless and automatic. |
Requires scripting to achieve SharePoint Products and Technologies failover. |
Does not protect against failed storage, because storage is shared between nodes in the cluster. |
Protects against failed storage because both the principal and mirror database servers write to local disks. |
Requires more expensive shared storage. |
Can use less-expensive direct-attached storage (DAS). |
Same subnet. |
Up to 1 millisecond (ms) in latency between SQL Server and Web servers. |
Can use SQL Server simple recovery model, although the only available recovery point if the cluster is lost will be the last full backup. |
Requires SQL Server full recovery model. |
No performance overhead. |
Introduces transactional latency. Adds memory and processor overhead. |
Minimal operational burden. |
Additional operational burden, including scripting and setting up SQL Server aliases. |
Redundancy and failover between closely located data centers configured as a single farm (“stretched” farm)
Some enterprises have data centers that are closely located with high-bandwidth connections, so that they can be configured as a single farm. This is called a “stretched” farm. For a stretched farm to work, there must be less than 1 millisecond latency between SQL Server and the Web servers in one direction, and at least 1 gigabit per second bandwidth.
In this scenario, you can provide redundancy for databases by using synchronous mirroring. Within a stretched farm, you can mirror the configuration database and content databases. For a case study of how one company used a stretched farm, see Case Study of High Availability for SharePoint using Database Mirroring (white paper).
Consult with your SAN vendor to determine whether you can use SAN replication or another supported mechanism to provide availability across data centers, such as SQL Server running on geographically dispersed server clusters. Ensure that the SAN replication solution offers a sufficient level of concurrency and transactional consistency.
Within a stretched farm, you can provide fault tolerance for application servers that are running SSPs by having:
Multiple query servers
Multiple servers running Excel Calculation Services
The index server is a single point of failure within this scenario. You can either back up and restore search, or, if search currency upon recovery is vital, use a failover SSP farm. For more information, see Availability of Search after failover.
“Stretched” farm
Redundancy and failover between data centers with multiple farms
You can set up a failover farm to provide availability in a separate data center from the primary farm. An environment with a separate failover farm has the following characteristics:
A separate configuration database and Central Administration content database must be maintained on the failover farm.
Note
If you have configured alternate access mapping for the primary farm, it is especially important to configure it identically on the failover farm.
All customizations must be deployed on both farms.
Patches must be applied to both farms, individually.
Only content databases can be successfully asynchronously mirrored or log shipped to the failover farm.
Mirrored or log-shipped databases must be set to use the full recovery model.
SSP databases can be backed up and restored to the failover farm.
Consult with your SAN vendor to determine whether you can use SAN replication or another supported mechanism to provide availability across data centers.
This topology can be repeated across many data centers, if you configure SQL Server log shipping to one or more additional data centers.
Note
SQL Server mirroring can only be used with one mirror server, but you can log ship to multiple secondary servers.
Primary and failover farms before failover
The following table describes the servers and server roles in a SharePoint Products and Technologies environment, and the basic redundancy strategies that can be used for each between server farms.
Server or server role | Preferred basic redundancy strategy between farms |
---|---|
SQL Server |
SQL Server asynchronous database mirroring, SQL Server log-shipping, or other asynchronous replication mechanism. Note Cannot be used for the SSP databases that host search information. |
Front-end Web servers |
Deploy on both farms, including customizations. |
Web server for medium server farms (Web application and Search Query services) |
Deploy on both farms. |
Search indexing |
Deploy on both farms. Recover backup from original farm to move to failover farm. |
Excel Calculation |
Deploy on both farms. If SSP does not host search, can use asynchronous database mirroring, SQL Server log shipping, or other asynchronous replication mechanism to move data to failover farm. If SSP also hosts Search, must recover backup from original farm to move. |
Project Application |
Deploy on both farms. Recover backup from original farm to move to failover farm. |
Asynchronous replication and search
Search requires complete synchronization between the Search database, SSP database, and index. Because of this requirement, search cannot be replicated between farms by using an asynchronous replication mechanism (asynchronous database mirroring, log shipping, or asynchronous SAN replication). To provide search on a failover farm, you must recover the Search SSP.
Note
If you are running an SSP without search or Project, you can use an asynchronous replication mechanism to move data.
Availability of search after failover
The index server role cannot be redundant within a farm. The needs of the business for search currency after failover may determine the logical architecture of the solution.
If the business does not require immediate search currency and availability after failover, you can back up and restore the Search SSP to the failover site.
If the business requires rapid search currency and availability, you can use one of the following:
A single farm architecture with two identical SSPs.
A centralized parent farm that hosts search and other SSPs. The search service at the central farm crawls content at all other farms. This architecture can be used to support one or more farms.
Important
If the business requires rapid search concurrency and availability, and you are using profiles, the profiles on the failover SSPs are not synchronized to the profiles on the primary SSPs—they will be in the state they were in when initially imported. To keep the profiles on all SSPs synchronized, you must use the User Profile Replication Engine that is included in the 32-bit SharePoint Administration Toolkit (https://go.microsoft.com/fwlink/?LinkId=119535) or the 64-bit SharePoint Administration Toolkit (https://go.microsoft.com/fwlink/?LinkId=119536). For more information, see User Profile Replication Engine (Office SharePoint Server).
Single farm with two SSPs
The following architecture protects against failure of an index server. In this topology, both SSPs crawl the same content, using the same rules. The failover SSP is not attached to the primary Web site unless a failover occurs.
Single farm with two Shared Services Providers
This topology has the following limitations:
Requires twice the space for indexes on each query server
Requires manually switching a Web application to use the failover SSP (can be scripted).
Reduces the size of the corpus you can crawl by half.
By default, if profiles are enabled, the profiles on the failover SSP are not synchronized to the profiles on the primary SSP. Instead, they will be in the state they were in when initially imported. To keep the profiles on all SSPs synchronized, you must use the User Profile Replication Engine that is included in the 32-bit SharePoint Administration Toolkit (https://go.microsoft.com/fwlink/?LinkId=119535) or the 64-bit SharePoint Administration Toolkit (https://go.microsoft.com/fwlink/?LinkId=119536). For more information, see User Profile Replication Engine (Office SharePoint Server).
The ability to crawl large data sets in a timely fashion is affected by several factors including the latency and bandwidth between index servers and Web servers.
In an environment with limited bandwidth, this topology can significantly reduce performance. Crawling content twice places additional load on the content repositories being crawled, which can affect repository performance. The ability of search to keep the index fresh may also be negatively affected.
Centralized SSP farms
In the following architecture, the use of a parent SSP farm protects against failure of an index server. Although this may appear to be a hardware-intensive solution, separate SSP farms can share some hardware such as a clustered or mirrored database server, as long as the index servers reside on separate servers. For more information about planning and configuring SSP farms, see Plan SSP architecture.
This topology has the following benefits:
SSP management is centralized.
Failure of a farm does not require a recrawl.
Centralized SSP farms
This topology has the following limitations:
Content crawling over a wide-area network (WAN) uses bandwidth.
Keeping indexes current can be difficult in environments with high volumes of data and high rates of change.
Query performance may be affected by the performance of WAN links.
By default, if profiles are enabled, the profiles on the failover SSP farm are not synchronized to the profiles on the primary SSP. Instead, they will be in the state they were in when initially imported. To keep the profiles on all SSPs synchronized, you must use the User Profile Replication Engine that is included in the 32-bit SharePoint Administration Toolkit (https://go.microsoft.com/fwlink/?LinkId=119535) or the 64-bit SharePoint Administration Toolkit (https://go.microsoft.com/fwlink/?LinkId=119536). For more information, see User Profile Replication Engine (Office SharePoint Server).
Summary
Carefully review your availability requirements. The higher the level of availability and the more systems you protect, the more complex and costly an availability solution is likely to be.
The costs of attaining availability should be evaluated based on business needs. Not all solutions within an organization are likely to require the same level of availability. You can offer differing levels of availability for different sites, different services (for example, search and business intelligence), or different farms.
Acknowledgements
The Microsoft Office SharePoint Server Content Publishing team thanks the following technical reviewers on this paper:
Bill Baer, Microsoft Online Services, Hosted SharePoint, Technology Architect
James Petrosky, Microsoft Consulting Services, Senior Consultant
Steve Peschka, Microsoft Consulting Services, IW Senior Architect
Dan Winter, Microsoft Customer Support Services, Escalation Engineer
Sean Livingston, Microsoft SharePoint Products and Technologies, Program Manager
Mike Watson, Technology Architect
Todd Carter, Microsoft Premier Field Engineering, Principal Premier Field Engineer
Mike Plumley, Microsoft Office Project Server, Writer
Christophe Fiessinger, Microsoft Office Project, Senior Technical Product Manager
Sid Shah, Microsoft Search, Program Manager
Luca Bandinelli, Microsoft SharePoint Products and Technologies, Program Manager