Site reliability engineering documentation

Site reliability engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services, and products.

video

Introduction to SRE: What is SRE? (1/3)

video

Introduction to SRE: Core Principles and Practices (2/3)

video

Introduction to SRE: How to Get Started (3/3)

Improving Reliability through Modern Operations Practices

SRE online courses

SRE resources

SRE on Azure

Documentation for SREs

Architecture

Monitoring

Provisioning and Delivery

Scaling

SRE talks from Microsoft

Culture

Incident Response and Post-Incident Reviews

Monitoring and Observability

Practices and Principles

Teams and Management

Tools and Technologies

Scaling