Testing for reliability

Regular testing should be performed as part of each major change and if possible, on a regular basis to validate existing thresholds, targets and assumptions. Testing should also ensure the validity of the health model, capacity model, and operational procedures.

Checklist

Have you tested your applications with reliability in mind?


  • Test regularly to validate existing thresholds, targets and assumptions.
  • Automate testing as much as possible.
  • Perform testing on both key test environments with the production environment.
  • Perform chaos testing by injecting faults.
  • Create and test a disaster recovery plan on a regular basis using key failure scenarios.
  • Design disaster recovery strategy to run most applications with reduced functionality.
  • Design a backup strategy that is tailored to business requirements and circumstances of the application.
  • Test and validate the failover and failback approach successfully at least once.
  • Configure request timeouts to manage inter-component calls.
  • Implement retry logic to handle transient application failures and transient failures with internal or external dependencies.
  • Configure and test health probes for your load balancers and traffic managers.
  • Apply chaos principles continuously.
  • Create and organize a central chaos engineering team.

Azure services

Reference architecture

Next step