Site reliability engineering documentation
Site reliability engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services, and products.
Improving Reliability through Modern Operations Practices
SRE resources
SRE on Azure
Documentation for SREs
Architecture
Provisioning and Delivery
SRE talks from Microsoft
Culture
- The Evolution of Site Reliability Engineering
- Building SRE: Culture from the Outside In
- Cultural Nuance and Effective Collaboration for Multicultural Teams
- Evolution of SRE and Rising Need of SRE Catalyzers
- Feedback Loops: How SREs Benefit and What Is Needed to Realize Their Potential
- Understanding Business Metrics Can Make You a Better SRE
- The Never-Ending Story of Site Reliability
- Every Day Is Monday in Operations
Incident Response and Post-Incident Reviews
Monitoring and Observability
- Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up
- Off the Beaten Path: Moving Observability Focus from Your Service to Your Customer
- You Get What You Measure—Why Metrics Are Important
- Weathering the Storm: How Early Warnings Save the Farm
- Capturing and Analyzing Millions of Queries without Any Overhead
- Event Correlation: A Fresh Approach towards Reducing MTTR
- How Robust Monitoring Powers High Availability for LinkedIn Feed
- Reducing MTTR and False Escalations: Event Correlation at Linkedin
Practices and Principles
- Availability—Thinking beyond 9s
- Mental Models for SREs
- Prioritizing Trust While Creating Applications
- Java Hates Linux. Deal with It.
- Characterizing and Understanding Phases of SRE Practices
- Security and SRE: Natural Force Multipliers
- Production Improvement Review: Taking a Bite Out of Repair Debt
- Ensuring Reliability of High-Performance Applications
- The Service Score Card—Gamifying Operational Excellence
- How to Improve a Service by Roasting It
Teams and Management
- Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way
- Leading without Managing: Becoming an SRE Technical Leader
- Differences in SRE Implementations across Companies
- 100 Teams, 100 Ways to Fail
- The Why, What, and How of Starting an SRE Engagement
- Building and Running SRE Teams
- College Student to SRE: Onboarding Your Entry Level Talent
- LinkedIn SRE: From Inception to Global Scale
- Splicing SRE DNA Sequences in the Biggest Software Company on the Planet
- Transforming Tier 1 Caterpillars to Butterflies
Tools and Technologies
- Azure SREBot: More than a Chatbot—an Intelligent Bot to Crush Mitigation Time
- TrafficShift: Avoiding Disasters at Scale
- Let's Build a Distributed File System
- TCP—Architecture, Enhancements, and Tuning
- BGP—The Backbone of the Internet
- The Ops in Serverless
- How We Used Kafka to Scale Database Infrastructure
- Networks for SREs: What Do I Need to Know for Troubleshooting Applications
- Ambry—LinkedIn’s Distributed Immutable Object Store
- BPerf—Bing.com Cloud Profiling on Production
- DNS: Old Solution for Modern Problems
- Traffic Steering using Rum DNS @ LinkedIn
Scaling
- Traffic Forecasting and Stress Testing Infrastructure
- Learning at Scale Is Hard! Outage Pattern Analysis and Dirty Data
- Scaling a Distributed Stateful System: A LinkedIn Case Study
- Debugging at Scale—Going from Single Box to Production
- Building Centralized Caching Infrastructure at Scale
- Scalable Coding—Find the Error
- Managing Capacity @ LinkedIn
- InStream: Large Scale Distribution using BitTorrent, Python, Salt, and Kafka
- Avoiding and Breaking Out of Capacity Prison
- The Evolution of Global Traffic Routing and Failover