SharePoint 2010 Search 'Dogfood' Part 1 - Hardware

 

Hello again, Dan Blood here. To layout the lessons I've learned in hosting Search 2010 I will provide you with a full picture of what the hardware behind SearchBeta looks like.  Be aware that the hardware is a little underpowered for the count of items in the index.  As such you should not take the hardware I have listed below verbatim and implement your solution on top of it, rather use this hardware and lessons as a starting point.  Coupled with the capacity planning document you can then tailor your hardware to your needs and business specific SLA requirements.

 

The starting point for the hardware behind SearchBeta was to provide a search experience over roughly 60 million items and keep these items freshly crawled within a 24 hour time window.  Over time the corpora has grown to include ~72 million items.  With 3 Crawl databases the system is almost able to meet a 4 hour freshness target for the majority of the 72 million items. 

 

The ~72 million items are broken out across the following content sources:

  • Large enterprise collaboration portal (~24 million) -- SharePoint 2010 content
  • Early adopters (~15 million) -- Portals running the most recent SharePoint codebase
  • My Sites (~5 million) -- SharePoint 2010 My site content
  • Europe (~12 million) -- SharePoint 2007 content hosted across a WAN (170ms ping)
  • Asia (~9 million) -- SharePoint 2007 content hosted across a WAN (200ms ping)
  • IT Hosted Portals (~3 million) -- Mixed SharePoint 2007/2010 content
  • Non IT Hosted Portals (~3 million) -- Mixed HTTP and SharePoint 2007 content
  • BDC (~1.5 million) -- SQL server content crawled via BDC.

 

The query load of the system is not extreme with peaks of 120 queries per minute.  Given the query load and the desire to reduce costs I have implemented query redundancy with an active/passive scheme on each of the 6 query servers.  Meaning that each server is hosting 2 active partitions and 2 passive partitions providing full redundancy.  This configuration is typical for 60 million items, however, with ~72 million items we are out of capacity.  We recommend having enough memory to fit 33% of the index in RAM, with a combined active index size of 106GB we are only able to fit 30% of the index in memory.  Because of this 95% of our queries are at 1.1 second during a 24 period, to reach sub-second latencies we would need to meet the 33% guideline.

 

Hardware & Topology

The hardware for SearchBeta is a 10 box services farm with the following machines:

  • 6 Query Servers
  • 6 Instances of Search Query and Site Settings Service (QP) running on each Query Server
  • 2 Indexers
  • 1 SQL server for the Property & Admin database
  • 1 SQL server for all 3 Crawl databases

 

 

There are a few main areas that I would change for SearchBeta hardware if I were to purchase all of the hardware again from scratch:

  • Add 2 additional Query servers and/or machines with 48GB of RAM to correct our capacity problem. This would allow us to comfortably provide a search experience over 80 million items. The basic guideline is to have approximately 10 million items in active partitions per machine. It is OK to be a little above this, but you must fit 33% or more of the index into memory to have adequate query latency.
  • Add more memory in the Property SQL machine. However, I would need to look at the cost of purchasing 128GB of RAM versus the cost of purchasing two machines with 64GB. The cost of two machines would only be slightly more expensive (current memory prices) for the initial outlay and it would provide a little more room for growth. The basic guideline is to fit 33% of the Property store into memory.
  • Add redundancy into the SQL servers. Currently there is no way to patch or cycle the SQL boxes without taking down time for the service.

 

 

 Query Server machine specs

We initially started with 4 query servers and grew to 6.  As a result we have query servers with different clock speeds.  This is discouraged as the slower machine will degrade the overall query latency because every query must be executed across each unique index partition. 

6 Machines

  •  
    • 4 - Dell 2950 - 2 quad core 3.00Ghz (8 cores total)
    • 2 - Dell 2950 - 2 quad core 2.33Ghz (8 cores total)
    • 32GB RAM
    • Drives:6-15k SAS 450GB spindles
      • OS - R1
      • Index - R0 4 spindles

 

Property Database SQL machine specs

1 Machine

  •  
    • Dell 6850 - 4 dual core 3.2Ghz (8 cores total with hyper-threading disabled)
    • 64GB RAM
    • Internal drives: 2-15k 148GB; Raid 1 for the OS and SQL.
    • 1 - PERC 6/E Raid controller connected to an MD1120

 

MD1120 24-10k 148GB spindles

Data

Raid Type

Spindles

Space Used

Reserved / Used

Property database

Raid 1+0

12

405GB / 232GB

Property database log

Raid 1

2

9GB

Admin database

Raid 1+0

4

12GB / 12GB

Admin log

Raid 1

2

6GB

Temp database & log

*8 data files

Raid 1+0

4

114MB / 52MB

 

Crawler machine specs

2 Machines

  •  
    • Dell 2950 - 2 quad core 2.33Ghz (8 cores total)
    • 8GB RAM
    • Drives:2-15k SAs 148GB
      • OS & Product - R1

 

Crawl Database SQL machine specs

1 Machine

  •  
    • Dell 6850 - 4 dual core 3.2Ghz (8 cores total with hyper-threading disabled)
    • 64GB RAM
    • Internal drives: 2-15k 148GB; Raid 1 for the OS and SQL.
    • 3 - PERC 6/E Raid controllers each connected to a MD1000

 

3 - MD1000 45-15k 450GB spindles

Data

Raid Controller

Raid Type

Spindles

Space Used

Reserved / Used

Crawl database 1

2

Raid 1+0

6

116GB / 80GB

Crawl database 2

1

Raid 1+0

6

157GB / 56GB

Crawl database 3

1

Raid 1+0

6

70GB / 56GB

Crawl database log 1

3

Raid 1

2

54GB

Crawl database log 2

3

Raid 1

2

95GB

Crawl database log 3

3

Raid 1

2

33GB

Temp database & log

*8 data files

3

Raid 1+0

4

6.7GB / 5.7GB

 

In the coming posts I will dig further into Crawl and Query sides of the system as well as how SQL is utilized.  Providing even further details about how to monitor the running system and what areas to look at to see if the system is reaching capacity.

 

Dan Blood

Senior Test Engineer

Microsoft Corp