Monitor application to detect failures

Completed

Monitoring is crucial for resiliency. If part of the application stack fails, you need to know that it failed, and you need insights into the cause of the failure.

Monitoring a large-scale distributed system poses a significant challenge. For example, image an application that runs on a few dozen VMs. It's not practical to log into each VM, one at a time, and look through log files, trying to troubleshoot a problem. Moreover, the number of VM instances may not be static. In some applications, VMs get added and removed as the application scales in and out. Occasionally an instance may fail and need to be reprovisioned. In addition to this complexity, a typical cloud application might use multiple data stores such as Azure storage, SQL Database, Cosmos DB, Redis cache, and a single user action can span multiple subsystems.

You can think of the monitoring process as a pipeline with several distinct stages:

  • Instrumentation. The raw data for monitoring comes from a variety of sources, including application logs, operating systems performance metrics, Azure monitoring resources, Azure Service Health and subscriptions and Azure tenants. Most Azure services expose metrics that you can configure to analyze and determine the cause of problems.
  • Collection and storage. Raw instrumentation data can be held in various locations and with various formats (for example, application trace logs, IIS logs, performance counters). These disparate sources are collected, consolidated, and put into reliable data stores such as Application Insights, Azure Monitor metrics, Service Health, storage accounts and Log Analytics.
  • Analysis and diagnosis. After the data is consolidated in these different data stores, it can be analyzed to troubleshoot issues and provide an overall view of application health. Generally, you can search for the data in Application Insights and Log Analytics using Kusto queries. Azure Advisor provides recommendations with a focus on resiliency.
  • Visualization and alerts. In this stage, telemetry data is presented so that an operator can quickly notice problems or trends. Examples include dashboards or email alerts. With Azure dashboards, you can build a single-pane of glass view of monitoring graphs originating from Application Insights, Log Analytics, Azure Monitor metrics and service health. With Azure Monitor alerts, you can create alerts on service health and resource health.

Monitoring is not the same as failure detection. For example, your application might detect a transient error and retry, resulting in no downtime. But it should also log the retry operation, so that you can monitor the error rate to get an overall picture of application health.

Application logs are an important source of diagnostics data. Best practices for application logging include:

  • Log in production. Otherwise, you lose insight where you need it most.
  • Log events at service boundaries. Include a correlation ID that flows across service boundaries. If a transaction flows through multiple services and one of them fails, the correlation ID will help you pinpoint why the transaction failed.
  • Use semantic logging, also known as structured logging. Unstructured logs make it hard to automate the consumption and analysis of the log data, which is needed at cloud scale.
  • Use asynchronous logging. Otherwise, the logging system itself can cause the application to fail by causing requests to back up as they block while waiting to write a logging event.
  • Application logging is not the same as auditing. Auditing may be done for compliance or regulatory reasons. As such, audit records must be complete, and it's not acceptable to drop any while processing transactions. If an application requires auditing, this should be kept separate from diagnostics logging.

When you implement monitoring and diagnostics for a critical application, it is also vital to monitor the health of the periodic backup jobs. Azure Backup offers Backup Center a single unified management experience in Azure for enterprises to govern, monitor, operate, and analyze backups at scale. As a backup admin, Backup Center gives you a single pane of glass to monitor your jobs and backup inventory daily. You also can use Backup Center to perform your regular operations, such as responding to on-demand backup requests, restoring backups, creating backup policies, and so on. For analyzing historical trends and gaining deeper insights on your backups, Backup Center provides an interface to Backup Reports, which uses Azure Monitor Logs and Azure Workbooks.