Question 1

Scenario 1: Removing toil through automation

Accepted Answer

Site reliability engineers and others who use SRE practices attempt wherever possible to remove toil. "Toil" means a specific thing in SRE. Toil refers to operations work being done by a human that has certain characteristics. Toil has no long term redeeming value. It doesn't move the service forward in any meaningful way. It is often repetitive and largely manual (even though it could be automated). As the service or systems gets bigger over time, the number of requests for that system will also probably increase in quantity at a proportional rate and require even more manual labor.

For example, if a service requires the SRE team to reset something every week, or to provision new accounts and disk space by hand, or repeatedly restart it by hand--this is operational load that is toil. Completing those actions hasn’t made the service better in any long-term, persistent way. These actions will likely have to be repeated over and over.

SREs hate toil. They work to eliminate it whenever possible and appropriate. This is one of the places automation comes into play in SRE. If these requests can be handled automatically, that frees up the team to work on more rewarding and impactful things.

Coding expertise: automation requires some coding expertise, but it does not have to require full software engineering skills. If you can write small scripts (perhaps in PowerShell or Bourne shell) or even if you create an Azure logic app with barely any code, this app can still help eliminate toil.

Question 2

Scenario 2: Control through APIs/domain specific languages (DSLs)/templates

Accepted Answer

Though not strictly necessary for SRE work, being able to control environments through APIs, DSLs, and Templates (especially cloud environments) allow SREs to scale up their work. Provisioning/De-provisioning infrastructure, configuring monitoring, and integrating several services all becomes much more efficient via coding.

Coding expertise: like the previous scenario, this requires some coding expertise, but it does not have to require full software engineering skills. In addition to the scripts and logic apps mentioned before, Azure Resource Manager templates can also be used with minimal coding experience.

Question 3

Scenario 3: Fixing the code

Accepted Answer

Site reliability engineers look to improving the reliability of a system. This goal sometimes requires digging into the source code of a system, determining the issue, and often contributing a fix back to the code base. While the level of sophistication of this work can vary widely based on the situation, coding expertise is a definite requirement in these cases.

Coding expertise: full software engineering expertise is often required in this scenario.

Frequently asked questions: Do I need to know how to code to get involved with SRE?

Scenario 1: Removing toil through automation

Scenario 2: Control through APIs/domain specific languages (DSLs)/templates

Scenario 3: Fixing the code

Next steps

Feedback

Additional resources