Monitor a distributed system by using Application Insights and OpenCensus

Azure Event Hubs
Azure Functions
Azure Service Bus
Azure Monitor

This article describes a distributed system that's created with Azure Functions, Azure Event Hubs, and Azure Service Bus. It provides details about how to monitor the end-to-end system by using OpenCensus for Python and Application Insights. This article also introduces distributed tracing and explains how it works by using Python code examples. The fictional company, Contoso, is used in the architecture to help describe the scenario.

Note

OpenCensus and OpenTelemetry are merging, but OpenCensus is still the recommended tool to monitor Azure Functions. OpenTelemetry for Azure is in preview and some features aren't available yet.

Architecture

Diagram that shows the implemented architecture divided into three steps: query, process, and upsert.

Download a Visio file of this architecture.

Workflow

  1. Query. A timer-triggered Azure function queries the Contoso internal API to get the latest sales data once a day. The function uses the Azure Event Hubs output binding to send the unstructured data as events.

  2. Process. Event Hubs triggers an Azure function that processes and formats the unstructured data to a pre-defined structure. The function publishes one message to Service Bus per asset that needs to be imported via the Service Bus output binding.

  3. Upsert. Service Bus triggers an Azure function that consumes messages from the queue and runs an upsert operation in the common company storage.

It's important to consider potential operation failures of this architecture. Some examples include:

  • The internal API is unavailable, which leads to an exception that's raised by the query data Azure function in step one of the architecture.
  • In step two, the process data Azure function encounters data that's outside of the conditions or parameters of the function.
  • In step three, the upsert data Azure function fails. After several retries, the messages from the Service Bus queue go in the dead-letter queue, which is a secondary queue that holds messages that can't be processed or delivered to a receiver after a predefined number of retries. Then the messages can follow an established automatic process, or they can be handled manually.

Components

  • Azure Functions is a serverless service that manages your applications.
  • Application Insights is a feature of Azure Monitor that monitors applications in development, test, and production. Application Insights analyzes how an application performs, and it reviews application run data to determine the cause of an incident.
  • Azure Table Storage is a service that stores nonrelational structured data (structured NoSQL data) in the cloud and provides a key/attribute store with a schemaless design.
  • Event Hubs is a scalable event ingestion service that can receive and process millions of events per second.
  • OpenCensus is a set of open-source libraries that you can use to collect distributed traces, metrics, and logging telemetry. This architecture uses the Python implementation of OpenCensus.
  • Service Bus is a fully managed message broker with message queues and publish-subscribe topics.

Scenario details

Distributed systems are made of loosely coupled components. It can be difficult to understand how the components communicate and to fully perceive the end-to-end journey of a user request. This architecture helps you see how components are connected.

Like many companies, Contoso needs to ingest on-premises or third-party data in the cloud while also collecting data about their sales by using services and in-house tools. In this architecture, a department at Contoso built an internal API that exposes the unstructured data, and they ingest the data into common storage. The common storage contains structured data from every department. The architecture shows how Contoso extracts, processes, and ingests that metadata in the cloud.

When you build a system, especially a distributed system, it's important to make it observable. An observable system:

  • Provides a holistic view of the health of the distributed application.
  • Measures the operational performance of the system.
  • Identifies and diagnoses failures so you can quickly resolve an issue.

Distributed tracing

In this architecture, the system is a chain of microservices. Each microservice can fail independently for various reasons. When that happens, it's important to understand what happened so you can troubleshoot. It’s helpful to isolate an end-to-end transaction and follow the journey through the app stack, which consists of services or microservices. This method is called distributed tracing.

The following sections describe how to set up distributed tracing in the architecture. Select the following Deploy to Azure button to deploy the infrastructure and the Azure function app.

Note

There isn't an internal API in the architecture, so a read of an Azure file replaces the call to an API.

Deploy to Azure

Traces and spans

A transaction is represented by a trace, which is a collection of spans. For example, when you select the purchase button to place an order on an e-commerce website, several subsequent operations take place. Some possible operations include:

  • A POST request submits to the API, which then redirects you to a “waiting page.”
  • Writing logs with contextual information.
  • An external call to web-based software to request a billing page.

Each of these operations can be part of a span. The trace is a complete description of what happens when you select the purchase button.

Similarly, in this architecture, when the query data Azure function triggers to start the daily ingestion of the sales data, a trace is created that contains multiple spans:

  • A span to confirm the trigger details.
  • A span to query the internal API.
  • A span to create and send an event to Event Hubs.

A span can have child spans. For example, the following image shows the query data Azure function as a trace:

An image that shows a complete trace composed of spans and their child spans.

  • The sendMessages span is split into two child spans: splitToMessages and writeToEventHubs. The sendMessages span requires those two suboperations to send messages.

  • All spans are children of a root span.

  • Spans give you an easy way to describe all parts involved in the query step of the query data Azure function. Each Azure function is a trace. So an end-to-end pass through Contoso’s ingestion system is the union of three traces, which are the three Azure functions. When you combine the three traces and their telemetry, you build the end-to-end journey and describe all parts of the architecture.

Tracers and the W3C trace context

A tracer is an object that holds contextual information. Ideally, that contextual information propagates as data transits through the Azure functions. To propagate the information, the OpenCensus extension uses the W3C trace context.

As its documentation states, the W3C trace context is a "specification that defines standard HTTP headers and a value format to propagate context information that enables distributed tracing scenarios."

A component of the system, such as a function, can create a tracer with the context of the previous component that's making the call by reading the trace parent. The format of a trace is:

Traceparent: [version]-[traceId]-[parentId]-[traceFlags]

For instance, if traceparent = 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-00

base16(version) = 00

base16(traceId) = 0af7651916cd43dd8448eb211c80319c

base16(parentId) = b7ad6b7169203331

base16(traceFlags) = 00

The trace ID and parent ID are the most important fields. The trace ID is a globally unique identifier of a trace. The parent ID is a globally unique identifier of a span. That span is part of the trace that the trace ID identifies.

For more information, see Traceparent header.

In the remaining sections of this article, it's assumed that the base16(version) and base16(traceFlags) are set to 00.

Create a tracer with the OpenCensus extension

Use the OpenCensus extension that's specific to Azure Functions. Don't use the OpenCensus package that you might use in other cases (for example, Python Webapps).

Azure Functions offers many input and output bindings, and each binding has a different way of embedding the trace parent. For this architecture, when events and messages are consumed, two Azure functions are triggered.

Before the two functions can trigger:

  1. The context (characterized by the identifier of the trace and the identifier of the current span) must be embedded in a trace parent in the W3C trace context format. This embedding is dependent on the nature of the output binding. For instance, the architecture uses Event Hubs as a messaging system. The trace parent is encoded into bytes and embedded in the sent event as the diagnostic ID property, which achieves the right trace context in the output binding.

    Two spans can be linked even if they're not parent and child. For distributed tracing, the current span points to the next one. Creating a link establishes this relationship.

    The Azure Functions Worker package manages the embedding and linking for you.

  2. An Azure function in the middle of the end-to-end flow extracts the contextual information from the passed-on trace parent. Use the OpenCensus extension for Azure Functions for this step. Instead of adding this process in the code of each Azure function, the OpenCensus extension implements a preinvocation hook on the function app level.

    The preinvocation hook:

    • Creates a span context object that holds the information of the previous span and triggers the Azure function. See a visual example of this step in the next section.
    • Creates a tracer that contains the span context and creates a new trace for the triggered Azure function.
    • Injects the tracer in the Azure function execution context.

    To ensure the traces appear in Application Insights, you must call the configure method to create and configure an Azure exporter, which exports telemetry.

    The extension is at the app level, so the steps in this section apply to all Azure functions in a function app.

Understand and structure the code

In this architecture, the code in the Azure functions is structured with spans. In Python, create an OpenCensus span by using the with statement to access the span context part of the tracer that's injected in the Azure function execution context. The following string provides the details of the current span and its parents:

    with context.tracer.span("nameSpan"):
        # DO SOMETHING WITHIN THAT SPAN

The following code shows details of the query data Azure function:

import datetime
import logging

import azure.functions as func
from opencensus.extension.azure.functions import OpenCensusExtension
from opencensus.trace import config_integration

OpenCensusExtension.configure()
config_integration.trace_integrations(['requests'])
config_integration.trace_integrations(['logging'])

def main(timer: func.TimerRequest, outputEventHubMessage: func.Out[str], context: func.Context) -> None:

    utc_timestamp = datetime.datetime.utcnow().replace(
        tzinfo=datetime.timezone.utc).isoformat()

    if timer.past_due:
        logging.info('The timer is past due!')

    logging.info(f"Query Data Azure Function triggered. Current tracecontext is:      {context.trace_context.Traceparent}")
    with context.tracer.span("queryExternalCatalog"):
        logging.info('querying the external catalog')
        content = {"key_content_1": "thisisavalue1"}
        content = json.dumps(content)

    with context.tracer.span("sendMessage"):
        logging.info('reading the external catalog')

        with context.tracer.span("splitToMessages"):
            # Do sthg
            logging.info('splitting to messages')

        with context.tracer.span("setMessages"): 
            logging.info('sending messages')
            outputEventHubMessage.set(content)

    logging.info('Python timer trigger function ran at %s', utc_timestamp)

The main points in this code are:

  • An OpenCensusExtension.configure call. Perform this call in only one Azure function per function app. This action configures the Azure exporter to export Python telemetry, such as logs, metrics, and traces, to Application Insights.

  • The OpenCensus requests and logging integrations to configure the telemetry collection from the request and logging modules for HTTP calls.

  • There are five spans:

    • A root span that's part of the tracer that's injected in the context before the execution
    • queryExternalCatalog
    • sendMessage
    • splitToMessages (a child of sendMessage)
    • setMessages (a child of sendMessage)

Tracers and spans

The following diagram shows how every time a span is created, the span context of the tracer is updated.

An image that shows the code lines of the function.

In the previous diagram:

  1. An Azure function is triggered. A trace parent is injected in the tracer context object with a preinvocation hook, which is called by the Python worker before the function runs.
  2. An Azure function is run. The OpenCensusExtension.configure method is called, which initializes an Azure exporter and enables trace writing to Application Insights.

The following details explain the relationship between a tracer and a span in this architecture:

  • The tracer object of the Azure function context contains a span_context field that describes the root span.
  • Every time you create a span in code, it creates a new globally unique identifier and updates the span_context property in the tracer object of the execution context.
  • The span_context field contains the trace_id and id fields.
  • The trace_id never gets updated, but the id updates to the generated unique identifier.
  • In the previous diagram, the root span has two child spans: queryExternalApi and sendMessage.
    • The queryExternalApi span and sendMessage span have a new span ID that's different from the root_span_id.
    • The sendMessage span has two child spans: splitToMessages and setMessages. Their span IDs update in the span_context field of the tracer object of the context.
  • To capture the relationship between a child span and its parent, the spans_list field provides the lineage of spans in list form. In the splitToMessages span, the spans_list field contains sendMessage (the parent span) and splitToMessages (the current span). This parent/child relationship is how you create the chain of isolated operations within the execution of an Azure function.

Chain the functions by using the context field

Now that the chain of operations is organized in one Azure function, you can chain it to the subsequent operations performed by the next Azure function.

A diagram that shows how the functions are chained.

In the previous diagram:

  • The setMessages span is the last span of the query data Azure function. The code within the span sends a message to Event Hubs and triggers the subsequent Azure function. The span_context field of the context tracer object contains the information related to this span. That information is tied to the query data Azure function’s context.
  • Azure Functions Worker adds a bytes-encoded Diagnostic-Id in the properties of the sent event and creates a link to the root span of the subsequent Azure function.
  • The preinvocation hook of the subsequent process data Azure function reads the Diagnostic-Id and sets the context, which chains the Azure functions, and they're executed separately.

When the process data Azure function sends a message to the Service Bus queue, context is passed in the same way.

When the monitoring configurations are in place, use the Application Insights features to query and visualize the end-to-end transactions.

Types of telemetry

There are several types of telemetry available in Application Insights. The code in this architecture generates the following telemetry:

  • Request telemetry emits when you call an HTTP or trigger an Azure function. The entry to Contoso’s system has a timer trigger for the query data Azure function that emits request telemetry.
  • Dependency telemetry emits when you make a call to an Azure service or an external service. When the Azure function writes an event to Event Hubs, it emits dependency telemetry.
  • Trace telemetry emits from logs generated by Azure Functions runtime and Azure Functions. The logging inside the Azure function emits trace telemetry.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Other contributors:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps