Lecture Notes - Evolution of Google Search Engine
Jeff Dean gave a keynote Building Large Scale Information Retrieval Systems at WSDM 2009. It's actually a presentation on how Google search engine evolves during the past 10 years. Here are my notes for this lecture.
Part I - Overview of Search Engine Evolution: 1999 VS 2009
Factors to Consider when Designing a Information Retrieval System:
1. Corpus Size(# docs to be indexed)
2. QPS(Query Per Second)
3. Freshness/Update Rate
4. Query Latency
5. Complexity/Cost of Scoring/Retrieval Algorithm
Parameter Change 1999 -> 2009:
1. Corpus Size: 70M -> *B ~100X
2. QPS: ~1000X
3. Refresh: Months -> Minutes ~10000X
4. Latency: <1s> <02.s style="font-weight: bold; font-style: italic;">~5X
5. Machine Scale: ~1000X
Consider 10x Growth when designing, Rewrite for 100x Growth!
Part II - Evolution of Google Search Engine
~1997 - Circa, Research Prototype
- Simple Architecture and Focus on System Distributing/Partitioning
- Term vs Doc based Partition: Doc based Win
- Disk Based Index, DocID+Posting List with Position Attributes, Byte Aligned Encoding
- Introduced Cache
-- hit rate is low 30~60% due to index refresh and long tail query
-- very beneficial, reduce large disk i/o
-- hot term first priority to cache, hot and costy request
- Replica Index Data
-- better performance
-- better availability
Some Summary in late 1990's:
Crawler is simple batch system
- start with very few urls
- queue it when found new urls
- stop when have enough docs
Index Serving using cheap machine
- no failure handling
- added record/chunk checksum
Index Update
- once/month
- wait traffic to low -> take replica offline -> do update -> start serving
Situation:
- doc size:50m -> 1000m
- ~20% query traffic increase/month
- Yahoo! deal
Solution:
add machines constantly
- more index shards for larger index size
- more index replica for bigger query capacity
And improve software constantly
- better disk scheduling
- better index encoding
~2001 - Adding In-Memory Index
- enough machine memory: holding all index in mem
- machine function: replica -> micro shard holding
- balancer: cordinator
- availability: replicate important docs
- Generalize tree structured query flow
- Generalize balancer concept
- New index encoding: group varint encoding
- GFS appear in production
- Universal search: combine results from multiple vertical corpus
- Realtime search: fast url finding -> crawling -> indexing -> serving cycle
-
Experiment supporting: have idea -> try it on real data offline and
tune -> live experiment on small piece -> roll out and launch
Part III - Future Trends
- Cross Language Information Retrieval
- ACL in large IR system with huge amount of user and dynamic requirement
-
Automatic Construction of Efficient IR system (one bin for realtime and
regular web index with different parameter configuration)
- Info extraction from semi-structured data
[Reference]
1. Challenges in Building Large Scale Information Retrieval Systems[PDF, Video]
2. Notes by Another Blogger - https://www.searchenginecaffe.com/2009/02/jeffrey-dean-wsdm-keynote-building.html