Remediation
- 7 minutes
Dividing the incident response lifecycle into five phases as you've seen in this module helps you to understand the process, but the phases aren't always as distinct as they appear in the diagram. In particular, the line between the response and remediation phases often begins to blur. This is especially true when actions intended to mitigate or improve the situation have the opposite effect. In this case, response and remediation tend to overlap or go back and forth between the two.
In this unit, you'll learn more about remediation and the steps that make up this phase, as well as some helpful tips and tools. One important thing to note: you shouldn't take the measures outlined here as a prescriptive checklist.
If you do indeed have a checklist for remediation already in hand, that's often an indicator that it's time to bring automation into the picture. When you can describe exactly what needs to be done and in what order to remediate a problem, it's the perfect time to teach these steps to a machine so the system can do it for you.
Where to start
You learned about the importance of reducing the time it takes to respond to an incident. Now let's look at a few things that can help speed up the process of remediating, or fixing the problem.
Different team members might have different mental models of how things work and different ideas as to what the first step should be. One might first look at the logs, while another might first run queries and look at the metrics. There's no one single correct path to success.
However, it helps to provide people with context and guidance as to where they should go and what they should look at.
How and to whom to escalate
An important question to answer in formulating your remediation starting point is: when you get stuck, who can you call to escalate the problem? You should be trying to offload more of the responsibilities of on-call to the team in general, not just Operations or Site Reliability Engineering. It should be the responsibility of all team members to have the systems up and running to meet your reliability objectives.
What resources are helpful to the first responders?
The next consideration is to determine those things that the first responders can use to get started on the process. This could include relevant metrics, logs, queries, and so forth. These should be provided in an Azure Monitor workbook if possible. We'll talk about them in just a moment.
It's also useful to provide simple links to resources (often in a workbook). If your goal is to respond and remediate the issue as quickly as possible, helping people find the answers to questions without having to search for the right document or URL will speed up the process.
Update stakeholders
You can become so focused on fixing the problem that you might forget that there are many people who aren't directly involved in the response to the incident, but who want and need to know what's going on.
It's important to communicate with other internal teams and keep them apprised of what's happening when an incident occurs. If you don't provide them with consistent updates, they're likely to come around asking for a status update. They have every right to this information, but you need a better way to make them aware of the issue and what's being done about it.
You need to be clear about acknowledgment to your internal teams. Be clear in presenting what you know and what's being done and set expectations in terms of when they'll hear back from you.
The formula for your communications to stakeholders is simple:
- This is what we know.
- This is what we're doing.
- We'll get back to you in X amount of time.
This will help prevent stakeholders from coming to you and interrupting you when you're in the midst of trying to fix the problems.
One way to distribute this information is through the use of an easily editable status web page like the one we mentioned in the last unit. In many cases, you might wish to have a separate, more detailed status page for internal stakeholders and an external one for your customers. The preceding formula works for both cases.
Use Azure Monitor Workbooks
Azure Monitor Workbooks are a tremendously helpful feature for a team in the remediation phase. You can find Azure Monitor Workbooks in the Azure portal under Azure Monitor.
Note
Application Insights Troubleshooting Guides are deprecated, and the Troubleshooting guides menu item is no longer available. If you have existing Troubleshooting Guides, you can still access them from Azure Workbooks and convert them to the standard workbook type in the Azure portal. For all new content, use Azure Monitor Workbooks.
You can think of workbooks as "live documents" you can create using a page-creation interface. When you create a new one, you can add to the page:
- Arbitrary text, like a bulleted list of items to do or other helpful information for someone consulting the page
- Links to other systems, for example, links to other dashboards or documentation
- Kusto Query Language (KQL) queries
It's that last item which makes the document "live." In a previous module in this learning path, we explored the KQL query language built into Log Analytics and other parts of Azure Monitor. Using this language, we could write our own queries to return and display diagnostic information from our application and Azure infrastructure. When a KQL query is inserted into a workbook, the current results of that query are displayed live to the document's readers. This means that your workbook can say not only "Be sure to check the error rate on the web server" but can also show a current graph for that error rate right there next to the instructions. It can have a link like "here is the web server restart documentation" that takes the first responder right to the documentation they need.
Azure also provides existing templates to help you get started authoring your own workbooks. Here's a screenshot of some of the premade templates you may be offered:
There's an Advanced editor feature for workbooks that lets you inspect workbook content as JSON and export an Azure Resource Manager (ARM) template representation of the document. This means that it's possible to track and distribute these documents using the source-control system of your choice. It also allows you to automate the provisioning of workbooks, which is useful when you're provisioning other infrastructure. If you prefer Bicep, you can incorporate the workbook resource into a Bicep deployment outside the editor. Creating a set of custom workbooks to go with a new service at the time that the service is provisioned becomes easy using this best practice.
Other helpful tips and tools
Throughout this module, you've learned about the various tools and shortcuts you can use to increase the efficiency and reduce your incident response time. As we wrap up this last unit, we'll do a brief overview of some tools and techniques that are helpful in diagnosing problems within your systems.
- You can use the Application Dashboard link in Application Insights as a customizable starting point for application health and performance. Check Azure Service Health separately so you can tell whether the problem is with your systems or with an Azure service.
- You can use the Application Map in Application Insights to trace dependencies, spot failure hotspots, and narrow down where the issue is occurring. Following those relationships can help you find the likely source of an error, such as a malformed URL.
- You can use Log Analytics to query the telemetry and log data that your monitoring pipeline collects from across the system.
All of the preceding tools are invaluable in remediating problems.