What is reliability engineering?

Completed

Site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production. The goal is to bridge the gap between the development team that needs to ship continuously and the operations team that's responsible for the reliability of the production environment. Site reliability engineering shifts the responsibility of production reliability to the SRE on the development team.

Site reliability engineers typically spend up to 50% of their time on the daily tasks that keep the application reliable and the rest of their time developing software.

A key skill of a software reliability engineer is that they have a deep understanding of the application. This includes knowledge of the code, how the application runs, how it's configured, and how it scales.

Some of the typical responsibilities of a site reliability engineer are to:

  • Proactively monitor and review application performance.
  • Handle on-call and emergency support.
  • Ensure that the software has good logging and diagnostics.
  • Create and maintain operational runbooks.
  • Help triage escalated support tickets.
  • Work on feature requests, defects, and other development tasks.
  • Contribute to the overall product roadmap.
  • Perform live site reviews and capture feedback for system outages.

Site reliability engineering versus DevOps

DevOps builds a healthy working relationship between the operations staff and the development team. By breaking down the silos between the two, DevOps produces a more robust, reliable product.

Both SRE and DevOps are methodologies that address an organization's need for a way to manage the production environment. As you've learned in the previous modules, DevOps feedback systems can identify problems and alert the developers, who then solve the issue. With SRE, a person on the development team looks for issues with site reliability on a daily basis and is probably the person who solves those problems, as well. While DevOps teams would usually choose to leave the production environment untouched unless absolutely necessary, SREs will likely make changes.

Site reliability engineering skills

The type of skills that are needed vary depending on the application, how and where it's deployed, and how it's monitored. For example, organizations that use serverless technologies won't need someone with in-depth knowledge of Windows or Linux systems management. However, these skills are critical to teams that use servers for deployments.

Other key skills for a good SRE focus on application monitoring and diagnostics. An SRE should have experience with application performance management tools like Application Insights. They should also understand application logging best practices and exception handling.