Monitoring for reliability

Monitoring and diagnostics are crucial for reliability. If something fails, you need to know that it failed, when it failed, and why.

Checklist

How do you monitor and measure application health?

  • The application is instrumented with semantic logs and metrics.
  • Application logs are correlated across components.
  • All components are monitored and correlated with application telemetry.
  • Key metrics, thresholds, and indicators are defined and captured.
  • A health model has been defined based on performance, availability, and recovery targets.
  • Azure Service Health events are used to alert on applicable service level events.
  • Azure Resource Health events are used to alert on resource health events.
  • Monitor long-running workflows for failures.

Azure services for monitoring

Reference architecture

Next steps