Documentazione di Site Reliability Engineering
Site Reliability Engineering è una disciplina progettuale destinata ad aiutare le organizzazioni a raggiungere il livello appropriato di affidabilità nei loro sistemi, servizi e prodotti, nell'ottica della sostenibilità.
Migliorare l'affidabilità tramite procedure operative moderne
SRE in Azure
Documentazione per SRE
Architettura
Provisioning e distribuzione
Conversazioni su SRE di Microsoft
Impostazioni cultura
- The Evolution of Site Reliability Engineering
- Building SRE: Culture from the Outside In
- Cultural Nuance and Effective Collaboration for Multicultural Teams
- Evolution of SRE and Rising Need of SRE Catalyzers
- Feedback Loops: How SREs Benefit and What Is Needed to Realize Their Potential
- Understanding Business Metrics Can Make You a Better SRE
- The Never-Ending Story of Site Reliability
- Every Day Is Monday in Operations
Incident Response and Post-Incident Reviews
Monitoring and Observability
- Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up
- Off the Beaten Path: Moving Observability Focus from Your Service to Your Customer
- You Get What You Measure—Why Metrics Are Important
- Weathering the Storm: How Early Warnings Save the Farm
- Capturing and Analyzing Millions of Queries without Any Overhead
- Event Correlation: A Fresh Approach towards Reducing MTTR
- How Robust Monitoring Powers High Availability for LinkedIn Feed
- Reducing MTTR and False Escalations: Event Correlation at Linkedin
Practices and Principles
- Availability—Thinking beyond 9s
- Mental Models for SREs
- Prioritizing Trust While Creating Applications
- Java Hates Linux. Deal with It.
- Characterizing and Understanding Phases of SRE Practices
- Security and SRE: Natural Force Multipliers
- Production Improvement Review: Taking a Bite Out of Repair Debt
- Ensuring Reliability of High-Performance Applications
- The Service Score Card—Gamifying Operational Excellence
- How to Improve a Service by Roasting It
Teams and Management
- Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way
- Leading without Managing: Becoming an SRE Technical Leader
- Differences in SRE Implementations across Companies
- 100 Teams, 100 Ways to Fail
- The Why, What, and How of Starting an SRE Engagement
- Building and Running SRE Teams
- College Student to SRE: Onboarding Your Entry Level Talent
- LinkedIn SRE: From Inception to Global Scale
- Splicing SRE DNA Sequences in the Biggest Software Company on the Planet
- Transforming Tier 1 Caterpillars to Butterflies
Tools and Technologies
- Azure SREBot: More than a Chatbot—an Intelligent Bot to Crush Mitigation Time
- TrafficShift: Avoiding Disasters at Scale
- Let's Build a Distributed File System
- TCP—Architecture, Enhancements, and Tuning
- BGP—The Backbone of the Internet
- The Ops in Serverless
- How We Used Kafka to Scale Database Infrastructure
- Networks for SREs: What Do I Need to Know for Troubleshooting Applications
- Ambry—LinkedIn’s Distributed Immutable Object Store
- BPerf—Bing.com Cloud Profiling on Production
- DNS: Old Solution for Modern Problems
- Traffic Steering using Rum DNS @ LinkedIn
Scalabilità
- Traffic Forecasting and Stress Testing Infrastructure
- Learning at Scale Is Hard! Outage Pattern Analysis and Dirty Data
- Scaling a Distributed Stateful System: A LinkedIn Case Study
- Debugging at Scale—Going from Single Box to Production
- Building Centralized Caching Infrastructure at Scale
- Scalable Coding—Find the Error
- Managing Capacity @ LinkedIn
- InStream: Large Scale Distribution using BitTorrent, Python, Salt, and Kafka
- Avoiding and Breaking Out of Capacity Prison
- The Evolution of Global Traffic Routing and Failover