Instrumentation and Telemetry Guidance
Most applications will include diagnostics features that generate custom monitoring and debugging information, especially when an error occurs. This is referred to as instrumentation, and is usually implemented by adding event and error handling code to the application. The process of gathering remote information that is collected by instrumentation is usually referred to as telemetry.
Why Is Instrumentation and Telemetry Important?
While the logs of information generated by the built-in infrastructure and system diagnostics mechanism can provide useful information about operations and errors, most applications will include additional instrumentation that generates custom monitoring and debugging information. This instrumentation typically generates log entries in Windows Event Log, separate trace log files, custom log files, or entries in a data store such as a relational database. These logs provide the information required to monitor and debug the application.
However, in complex applications, and especially applications that must scale to extremely high capacity, the huge volume of data collected can overwhelm simple monitoring systems and techniques. For example, the amount of information generated by hundreds of web and worker roles, database shards, and additional services—much of which may be of relatively low statistical significance, uncorrelated, and delayed in delivery—can become almost impossible to handle in a meaningful way. Instead, you can implement a telemetry solution that collects and highlights operational events and reduces management costs, while at the same time giving useful insights into the application behavior in terms of meeting service level agreements (SLAs) and for guiding future decisions on resource planning.
This topic does not discuss the details of writing code to instrument applications; in most cases the principles and practices for detecting and handling system and application events in cloud-hosted applications, and for defining metrics and key performance indicators (KPIs), is the same as for any other application.
Instrumentation allows you to capture vital information about the operation of your application. This information will generally include:
- Details of operational events that occur as part of the normal operation of the application, together with useful information about that event. For example, in an ecommerce site it would be useful to record the order number and value of each order that is placed. These are typically informational events that are used to collect data about the way the application is used.
- Details of runtime events that occur, and useful information about that event such as the location or data store used and the response time for access to the data store. These are also informational events that can provide additional insight into the normal operation of the application. The event should not include any sensitive information such as credentials, or any other data that might enable an attacker obtaining the logs to compromise the system.
- Specific data about errors that occur at runtime, such as the customer ID and other values associated with an order update operation that failed. Typically these are warning or error events and will contain one or more system-generated error messages.
- Data from performance counters that measure specific values related to the operation of the application. These might be built-in system counters, such as those that measure processor load and network usage, or they might be custom performance counters that measure the number of orders placed or the average response time of a specific component.
A common implementation of instrumentation is an ability to change the level of detail that is collected on demand, usually by editing the configuration of the application. Under normal conditions the data from some informational events may not be required, thus reducing the cost of storage and the transactions required to collect it. When there is an issue with the application, you update the application configuration so that the diagnostics and instrumentation systems collect informational event data, as well as error and warning messages, to assist in isolating and fixing faults. It may be necessary to run the application in this extended reporting mode for some time if the problem appears only intermittently.
The fundamental approach to designing the instrumentation for any application can be defined by considering the requirements in terms of the basic stages for isolating and fixing errors:
- Detect performance issues and errors quickly. Performance counters and event handlers can indicate problems in specific areas of the application due to component and service failures, overloading, and other issues; sometimes before end users are affected. This requires a constant or scheduled mechanism that monitors key thresholds and triggers the appropriate alerts. Detailed information from the instrumentation allows you to drill down into the execution and trace faults.
- Classify the issue to understand its nature. In cloud-hosted applications that run in a shared multitenant environment such as Microsoft Azure, some issues such as connecting to a database may be transient and will resolve themselves. Other issues may be systemic, such as a coding error or an incorrect configuration setting, and will require intervention to resolve the problem.
- Recover from an incident and return the application to full operation. Use the information you collect, perhaps after turning on additional logging settings, to fix the problem and return the application to full service. This is especially important if it is a commercially valuable application for the organization, such as an ecommerce site, or has SLAs that you must meet to avoid financial penalties and customer dissatisfaction.
- Diagnose the root cause of the problem and prevent it reoccurring. Carry out root cause analysis to determine the original cause and the underlying nature of the problem, and make changes to prevent reoccurrence where this is possible. The instrumentation data collected over time will help you identify recurring patterns and trends that led up to the incident, such as overloading of a specific component or invalid data being accepted by the application.
To be able to perform these steps you will need to collect information from all levels of the application and infrastructure. For example, you should consider collecting data about the infrastructure such as CPU load, I/O load, and memory usage; data about the application such as database response times, exceptions, and custom performance counters; and data about business activities and KPIs such as the number of each type of business transaction per hour and the response time of each service the application uses.
For a comprehensive guide to topics related to monitoring, instrumentation, and telemetry in Azure applications see Cloud Service Fundamentals on the TechNet Wiki. The topic Telemetry – Application Instrumentation provides information about designing and implementing instrumentation that will support telemetry.
You can use frameworks and third party products to help you implement the instrumentation in your applications. For example, Enterprise Library from the Microsoft patterns & practices team includes application blocks that can help you to simplify and standardize exception handling and logging in Azure applications. For more information see Enterprise Library 6 on MSDN.
Most logging mechanisms store log entries that contain a string value that is the description or message for the entry. With the advent of Event Tracing for Windows (ETW) it became possible to store a structured payload with the event entry. This payload is generated by the listener or sink that captures the event, and it can include typed information that makes it much easier for automated systems to discover useful information about the event. This approach to logging is often referred to a structured logging, typed logging, or semantic logging.
As an example, an event that indicates an order was placed can generate a log entry that contains the number of items as an Integer value, the total value as a decimal number, the customer identifier as a Long value, and the city for delivery as a String value. An order monitoring system can read the payload and easily extract the individual values. With traditional logging mechanisms the monitoring application would need to parse the message string to extract these values, increasing the chance that an error could occur if the message string was not formatted exactly as expected.
ETW is a feature of all current versions of the Windows operating system, and can be leveraged in Azure applications when you collect Event Log data as part of your diagnostics configuration.
It is possible to create events entries for ETW by using the EventSource class in the .NET framework directly, but it’s not a simple task. Instead, consider using a logging framework that provides a simple and consistent interface to minimize errors and simplify the code required in the application. Most logging frameworks can write event data to different types of logging destinations, such as disk files, as well as to Windows Event Log.
The Semantic Logging Application Block developed by the Microsoft patterns & practices team is an example of a framework that makes comprehensive logging easier. You create a custom event source by inheriting and extending the EventSource class in the System.Diagnostics.Tracing namespace. When you write events to the custom event source the Semantic Logging Application Block detects this and allows you to write the event to other logging destinations such as a disk file, database, email message, and more.
You can use the Semantic Logging Application Block in Azure applications that are written in .NET and run in Azure Web Sites, Cloud Services, and Virtual Machines. However, the choice of logging destination varies depending on the hosting method you choose. Consider writing to Azure storage or Azure SQL Database if you need to log events outside of the Windows Event Log.
For more information see the blog post Embracing Semantic Logging.
Telemetry, in its most basic form, is the process of gathering information generated by instrumentation and logging systems. Typically, it is performed using asynchronous mechanisms that support massive scaling and wide distribution of application services. In large and complex applications, information is usually captured in a data pipeline and stored in a form that makes it easier to analyze, and capable of presenting information at different levels of granularity. This information is used to discover trends, gain insights into usage and performance, and to detect and isolate faults.
Azure has no built-in system that directly provides a telemetry and reporting system of this type, but a combination of the features exposed by all of the Azure services allows you to create telemetry mechanisms that span the range from simple monitoring to comprehensive dashboards. The complexity of the telemetry mechanism you require usually depends on the size of the application. This is based on several factors such as the number of role or virtual machine instances, the number of ancillary services it uses, the distribution of the application across different datacenters, and other related factors.
A common approach is to collect all of the data from instrumentation and monitoring functions into a central repository such as a database located close to the application. This minimizes the write time, though it is still good practice to use asynchronous techniques based on queues and listeners to collect this information in a way that minimizes the impact on the application. Patterns such as Queue-based Load Leveling and Priority Queue are useful here.
The combination of all of the data in the data store can then be used to update live displays of activity and errors, generate reports and charts, and can be analyzed using database queries or even a big data solution such as HDInsight.
Considerations for Instrumentation and Telemetry
Consider the following points when designing an instrumentation and telemetry system:
- Identify the combination of information you need to collect from the built-in system monitoring features and instrumentation (such as logs and performance counters), and what additional instrumentation is required in order to comprehensively measure application performance, monitor availability, and isolate faults. There is no point in collecting information that you will never use. However, failing to collect something that might be useful, especially for debugging purposes, could make maintenance and troubleshooting more difficult. Also ensure that the logging configuration can be modified at runtime without requiring the application to be restarted; the Runtime Reconfiguration pattern is useful in this scenario.
- Use the telemetry data not only to monitor performance and to obtain early warning of problems, but also to isolate issues that arise, detect the nature of faults, perform root cause analysis, and for metering. Telemetry should be applied to both test and staged versions of the application during development to measure and validate performance, and to ensure that instrumentation and telemetry systems are operating correctly. Consider making data such as real-time, summary, and trend views available to development teams as well as administrators in order that issues can be more quickly resolved, and the code can be improved where necessary.
- Consider implementing two (or more) separate channels for telemetry data, one of which is used for vital operational information such as failure of the application, services, or components. It is important that this channel receives a higher level of monitoring and alerting than channels that simply record day-to-day operational data. The Priority Queue pattern is useful in this scenario. Fine tune the alerting mechanism over time to ensure that false alarms and noise are kept to a minimum.
- Ensure you collect all information from the exceptions you handle, not just the current exception message. Many exceptions wrap inner exceptions, which may provide additional useful information.
- Log all calls to external services. Include information about the context, destination, method, timing information (such as latency), and the result (such as success or failure, and the number of retries). This information may also be useful if you need to support reports of SLA violations, either from users of your application or when challenging your hosting provider regarding failures of their services.
- Log details of transient faults and failovers in order to detect emerging or ongoing problems. For example, record the number of times that a retry action occurs, the state of a circuit breaker changes, or the applications fails over to a different instance or configuration.
- Careful categorization of the data when it is written to the data store can simplify analysis and real-time monitoring, and can also assist in debugging and isolating faults. For example, it may be useful in the monitoring tools to be able to extract just data that arises from instrumentation of the application business functions, or from performance counters that measure certain infrastructure resources such as CPU and memory usage. Consider partitioning telemetry data by date, or even by hour, so that aggregators and database grooming tasks are not acting on tables that are actively being written to.
- The mechanisms for collecting and storing the data must themselves be scalable in order to match the number of items generated as the application and its services are scaled to an increasing number of instances. Ideally you should use a separate storage account for monitoring and logging data to minimize the impact of storage transactions for this data on the storage performance of the application itself, and to isolate the logging data from the application data for security purposes (for example, so that administrators and users of the monitoring system cannot access application data). Ensure the telemetry system itself is monitored so that a failure does not go undetected.
- If the application is located in different datacenters, you must decide whether to collect the data in each datacenter and combine the results in the monitoring system (such as an on-premises telemetry dashboard), or whether to centralize the data storage in one datacenter. Passing data between datacenters will incur additional costs, though this may be balanced by the savings in downloading only one dataset.
- Where possible, minimize the load on the application by using asynchronous code or queues to write events to the data store, and to move telemetry data between service instances. Avoid communicating telemetry information through a logging channel using a chatty approach, which might otherwise overwhelm the diagnostics system, or use separate channels for chunky (high-volume, high-latency, granular data) and chatty (low-volume, low-latency, high-value data) telemetry. One option for reducing the volume of telemetry data is to collect and store only data for events that are outside the normal operating limits.
- To prevent loss of data, include code to retry connections that may encounter transient errors. Design retry logic to be intelligent so that repeated failures are detected and the process abandoned after a preset number of attempts, and log the number of retries to help detect inherent or developing issues. Use variable retry intervals to minimize the chance that retry logic could overload a target system that is just recovering from a transient error when there are many queued retry attempts in the pipeline. See the Retry pattern for more information.
- You may need to implement a scheduler that collects some data items, such as performance counter values, at regular intervals if your hosting environment does not provide this feature (in Azure you can configure automatic collection in the diagnostics mechanism). Consider how often this data collection should occur, and the effect of the collection overhead on the performance of the application. Data such as performance counters, event logs, and trace events written into Azure table service is written in a 60 seconds wide temporal partition. Attempting to write too much data, such as an excessive number of point sources or with too narrow a collection interval, can overwhelm the table partition. Also ensure that error spikes do not trigger a high volume of insert attempts into table storage because this might trigger a throttling event.
- Consider how you will remove old or stale telemetry data that is no longer relevant. This may be a scheduled task, or initiated manually when versions change.
For more information see Cloud Service Fundamentals: Telemetry basics and troubleshooting on the Azure blog and Microsoft Azure: Telemetry Basics and Troubleshooting on the TechNet Wiki.
Related Patterns and Guidance
The following patterns and guidance may also be relevant to your scenario when implementing instrumentation and telemetry for your applications:
- Health Endpoint Monitoring Pattern. It is typically necessary to supplement instrumentation and telemetry by monitoring applications and services to ensure that they are available, and are performing correctly. The Health Endpoint Monitoring pattern describes how to do this by submitting a request to a configurable set of endpoints and evaluating the results against a set of configurable rules.
- Service Metering Guidance. Instrumentation can be used to provide information for metering the use of applications and services. The Service Metering guidance explores how to meter the use of applications or services in order to plan future requirements; to gain an understanding of how they are used; or to bill users, organization departments, or customers.
- Queue-based Load Leveling Pattern. Telemetry systems should be designed in such a way that they exert minimum load on the monitored applications and services. Using queues to transmit telemetry data can help to achieve this. The Queue-based Load Leveling pattern explains how a queue can act as a buffer between a task and a service that it invokes to minimize the impact of peaks in demand on availability and responsiveness for both the task and the service.
- Priority Queue Pattern. Telemetry systems often need to transmit data over more than one channel to ensure that important information is delivered quickly. The Priority Queue pattern shows how you can prioritize requests sent to services so that requests with a higher priority are received and processed more quickly than those of a lower priority.
- Retry Pattern. Telemetry systems must be resilient to transient failures and able to recover gracefully. The Retry pattern explains how to handle temporary failures when connecting to a service or network resource by transparently retrying the operation in the expectation that the failure is transient.
- Runtime Reconfiguration Pattern. Instrumentation is typically designed in such a way as the level of detail it generates can be adjusted at runtime to assist with debugging and root cause analysis. The Runtime Reconfiguration pattern explores how components of monitoring mechanisms can be reconfigured without requiring redeployment or restarting the application.
The article Cloud Service Fundamentals on the TechNet Wiki.
The article Telemetry – Application Instrumentation on the TechNet Wiki.
The Enterprise Library 6 information on MSDN.
The article Microsoft Azure: Telemetry Basics and Troubleshooting on the TechNet Wiki.
The article Event Tracing on MSDN.
The article Windows Event Log on MSDN.
The article Embracing Semantic Logging on Grigori Melnik's blog.