Introduction

2 minutes

Cloud computing has resulted in software solutions that are more reliable and more available than ever before. But reliability and availability don't happen by accident. They are the product of careful design, and they must be ensured by constant monitoring that alerts administrators when problems occur (or preferably before they occur). Monitoring is as important in mission-critical solutions deployed to the cloud as the solutions themselves. Without it, you don't know whether a solution is serving its users' needs.

Monitoring takes on many forms. It can be as simple as a software component that periodically pings a running application to make sure it's still running. It can also be more complex, keeping running track of key metrics such as request wait times (the length of time a request waits in a queue before it is processed by the system) and CPU utilization. Monitoring can trigger actions such as increasing the number of virtual machines to handle increased workloads or notifying an administrator of a condition that warrants attention. Because it's not reasonable to expect human operators to monitor systems 24x7, monitoring is automated through software. That software can come from third parties, or it can come from the cloud platform itself.

This module deals with the subject of monitoring the behavior of IT systems, which concerns both the infrastructure of these systems and the applications they host. We begin by making the case for automated monitoring, and then examine some of the more common application performance monitoring (APM) platforms. Next, we address the topic of metrics, which are the key measurements a monitoring platform should track, and discuss how these metrics are put to work ensuring system integrity and improving efficiency and robustness. Finally, we tackle the concept of remediation planning, which defines how performance issues are resolved and mitigated.

Learning objectives

Explain the main argument in favor of continual and consistent monitoring and oversight in cloud-based IT systems
Describe the three types of quantitative measurements that play a role in monitoring
Understand the mechanics of monitoring platforms that utilize agents to collect and report back information
Understand the mechanics of monitoring platforms that rely upon pre-existing sources of information such as service logs to analyze performance
Understand the metrics that are the most beneficial to monitoring
Learn how measurements are used to judge performance levels
Describe the justification for problem ticketing
Describe what KPIs are and how they differ from metrics
Discuss the concept of “everyday remediation”

Prerequisites

Understand what cloud computing is, including cloud service models, and common cloud providers
Recognize cloud service models such as IaaS, PaaS, and SaaS and differentiate between them
Be familiar with different organizational IT cost types like CapEx and OpEx
Understand how cloud resource provisioning works

Continue

Learning objectives

Prerequisites

Feedback