The Dapr observability building block

Tip

This content is an excerpt from the eBook, Dapr for .NET Developers, available on .NET Docs or as a free downloadable PDF that can be read offline.

Dapr for .NET Developers eBook cover thumbnail.

Modern distributed systems are complex. You start with small, loosely coupled, independently deployable services. These services cross process and server boundaries. They then consume different kinds of infrastructure backing services (databases, message brokers, key vaults). Finally, these disparate pieces compose together to form an application.

With so many separate, moving parts, how do you make sense of what is going on? Unfortunately, legacy monitoring approaches from the past aren't enough. Instead, the system must be observable from end-to-end. Modern observability practices provide visibility and insight into the health of the application at all times. They enable you to infer the internal state by observing the output. Not only is observability mandatory for monitoring and troubleshooting distributed applications, it needs to be implemented at the start.

The system information used to gain observability is referred to as telemetry. It can be divided into four broad categories:

  1. Distributed tracing provides insights into the traffic between services involved in distributed business transactions.
  2. Metrics provides insights into the performance of a service and its resource consumption.
  3. Logging provides insights into how code is executing and if errors have occurred.
  4. Health endpoints provide insight into the availability of a service.

The depth of telemetry is determined by the observability features of an application platform. Consider the Azure cloud. It provides a rich telemetry experience that includes all of the telemetry categories. With little configuration, Azure IaaS and PaaS services will propagate and publish telemetry to the Azure Monitor and Azure Application Insights services. Application Insights presents system logging, tracing, and problem areas with highly visual dashboards. It can even render a diagram showing the dependencies between services based on their communication.

However, what if an application can't use Azure PaaS and IaaS resources? Is it still possible to take advantage of the rich telemetry experience of Application Insights? The answer is yes. A non-Azure application can import libraries, add configuration, and instrument code to emit telemetry to Azure Application Insights. However, this approach tightly couples the application to Application Insights. Moving the app to a different monitoring platform could involve expensive refactoring. Wouldn't it be great to avoid tight coupling and consume observability outside of the code?

With Dapr, you can. Let's look at how Dapr can add observability to our distributed applications.

What it solves

The Dapr observability building block decouples observability from the application. It automatically captures traffic generated by Dapr sidecars and Dapr system services that make up the Dapr control plane. The block correlates traffic from a single operation that spans multiple services. It also exposes performance metrics, resource utilization, and the health of the system. Telemetry is published in open-standard formats enabling information to be fed into your monitoring back end of choice. There, the information can be visualized, queried, and analyzed.

As Dapr abstracts away the plumbing, the application is unaware of how observability is implemented. There's no need to reference libraries or implement custom instrumentation code. Dapr allows the developer to focus on building business logic instead of observability plumbing. Observability is configured at the Dapr system level and is consistent across services, even when created by different teams, and built with different technology stacks.

How it works

Dapr's sidecar architecture enables built-in observability features. As services communicate, Dapr sidecars intercept the traffic and extract tracing, metrics, and logging information. Telemetry is published in an open standards format. By default, Dapr supports OpenTelemetry and Zipkin.

Dapr provides collectors that can publish telemetry to different back-end monitoring tools. These tools present Dapr telemetry for analysis and querying. Figure 10-1 shows the Dapr observability architecture:

Dapr observability architecture

Figure 10-1. Dapr observability architecture.

  1. Service A calls an operation on Service B. The call is routed from a Dapr sidecar for Service A to a sidecar for Service B.
  2. When Service B completes the operation, a response is sent back to Service A through the Dapr sidecars. They gather and publish all available telemetry for every request and response.
  3. The configured collector ingests the telemetry and sends it to the monitoring back end.

As a developer, keep in mind that adding observability is different from configuring other Dapr building blocks, like pub/sub or state management. Instead of referencing a building block, you add a collector and a monitoring back end. Figure 10-1 shows it's possible to configure multiple collectors that integrate with different monitoring back ends.

At the beginning of this chapter, four categories of telemetry were identified. The following sections will provide detail for each category. They'll include instruction on how to configure collectors that integrate with popular monitoring back ends.

Distributed tracing

Distributed tracing provides insight into traffic that flows across services in a distributed application. The logs of exchanged request and response messages are a source of invaluable information for troubleshooting issues. The hard part is correlating messages that belong to the same business transaction.

Dapr uses the W3C Trace Context to correlate related messages. It injects the same context information into requests and responses that form a unique operation. Figure 10-2 shows how correlation works:

W3C Trace Context example

Note

The trace context is often referred to as a correlation token in microservice terminology.

Figure 10-2. W3C Trace Context example.

  1. Service A invokes an operation on Service B. As Service A starts the call, Dapr creates a unique trace context and injects it into the request.
  2. Service B receives the request and invokes an operation on Service C. Dapr detects that the incoming request contains a trace context and propagates it by injecting it into the outgoing request to Service C.
  3. Service C receives the request and handles it. Dapr detects that the incoming request contains a trace context and propagates it by injecting it into the outgoing response back to Service B.
  4. Service B receives the response and handles it. It then creates a new response and propagates the trace context by injecting it into the outgoing response back to Service A.

A set of requests and responses that belong together is called a trace. Figure 10-3 shows a trace:

Traces and spans

Figure 10-3. Traces and spans.

In the figure, note how the trace represents a unique application transaction that takes place across many services. A trace is a collection of spans. Each span represents a single operation or unit of work done within the trace. Spans are the requests and responses that are sent between services that implement the unique transaction.

The next sections discuss how to inspect tracing telemetry by publishing it to a monitoring back end.

Use a Zipkin monitoring back end

Zipkin is an open-source distributed tracing system. It can ingest and visualize telemetry data. Dapr offers default support for Zipkin. The following example demonstrates how to configure Zipkin to visualize Dapr telemetry.

Enable and configure tracing

To start, tracing must be enabled for the Dapr runtime using a Dapr configuration file. Here's an example of a configuration file named dapr-config.yaml that enables tracing:

apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: dapr-config
  namespace: default
spec:
  tracing:
    samplingRate: "1"
    zipkin:
      endpointAddress: "http://zipkin.default.svc.cluster.local:9411/api/v2/spans"

The samplingRate attribute specifies the interval used for publishing traces. The value must be between 0 (tracing disabled) and 1 (every trace is published). With a value of 0.5, for example, every other trace is published, significantly reducing published traffic. The endpointAddress points to an endpoint on a Zipkin server running in a Kubernetes cluster. The default port for Zipkin is 9411. The configuration must be applied to the Kubernetes cluster using the Kubernetes CLI:

kubectl apply -f dapr-config.yaml
Install the Zipkin server

When installing Dapr in self-hosted mode, a Zipkin server is automatically installed and tracing is enabled in the default configuration file located in $HOME/.dapr/config.yaml or %USERPROFILE%\.dapr\config.yaml on Windows.

When installing Dapr on a Kubernetes cluster, Zipkin must be deployed manually. Use the following Kubernetes manifest file entitled zipkin.yaml to deploy a standard Zipkin server to a Kubernetes cluster:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: zipkin
  namespace: dapr-trafficcontrol
  labels:
    service: zipkin
spec:
  replicas: 1
  selector:
    matchLabels:
      service: zipkin
  template:
    metadata:
      labels:
        service: zipkin
    spec:
      containers:
        - name: zipkin
          image: openzipkin/zipkin-slim
          imagePullPolicy: IfNotPresent
          ports:
            - name: http
              containerPort: 9411
              protocol: TCP

---

kind: Service
apiVersion: v1
metadata:
  name: zipkin
  namespace: dapr-trafficcontrol
  labels:
    service: zipkin
spec:
  type: NodePort
  ports:
    - port: 9411
      targetPort: 9411
      nodePort: 32411
      protocol: TCP
      name: zipkin
  selector:
    service: zipkin

The deployment uses the standard openzipkin/zipkin-slim container image. The Zipkin service exposes the Zipkin web front end, which you can use to view the telemetry on port 32411. Use the Kubernetes CLI to apply the Zipkin manifest file to the Kubernetes cluster and deploy the Zipkin server:

kubectl apply -f zipkin.yaml
Configure the services to use the tracing configuration

Now everything is set up correctly to start publishing telemetry. Every Dapr sidecar that is deployed as part of the application must be instructed to emit telemetry when started. To do that, add a dapr.io/config annotation that references the dapr-config configuration to the deployment of each service. Here's an example of the Traffic Control FineCollection service's manifest file containing the annotation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: finecollectionservice
  namespace: dapr-trafficcontrol
  labels:
    app: finecollectionservice
spec:
  replicas: 1
  selector:
    matchLabels:
      app: finecollectionservice
  template:
    metadata:
      labels:
        app: finecollectionservice
      annotations:
        dapr.io/enabled: "true"
        dapr.io/app-id: "finecollectionservice"
        dapr.io/app-port: "6001"
        dapr.io/config: "dapr-config"
    spec:
      containers:
      - name: finecollectionservice
        image: dapr-trafficcontrol/finecollectionservice:1.0
        ports:
        - containerPort: 6001
Inspect the telemetry in Zipkin

Once the application is started, the Dapr sidecars will emit telemetry to the Zipkin server. To inspect this telemetry, point a web-browser to http://localhost:32411. You'll see the Zipkin web front end:

Zipkin front end

Figure 10-4. Zipkin front end.

On the Find a trace tab, you can query traces. Pressing the RUN QUERY button without specifying any restrictions will show all the ingested traces:

Zipkin traces overview

Figure 10-5. Zipkin traces overview.

Clicking the SHOW button next to a specific trace, will show the details of that trace:

Zipkin trace details

Figure 10-6. Zipkin trace details.

Each item on the details page, is a span that represents a request that is part of the selected trace.

Inspect the dependencies between services

Because Dapr sidecars handle traffic between services, Zipkin can use the trace information to determine the dependencies between the services. To see it in action, go to the Dependencies tab on the Zipkin web page and select the button with the magnifying glass. Zipkin will show an overview of the services and their dependencies:

Zipkin dependencies

Figure 10-7. Zipkin dependencies.

The animated dots on the lines between the services represent requests and move from source to destination. Red dots indicate a failed request.

Use a Jaeger or New Relic monitoring back end

Beyond Zipkin, other monitoring back-end software can also ingest telemetry with the Zipkin format. Jaeger is an open source tracing system created by Uber Technologies. It's used to trace transactions between distributed services and troubleshoot complex microservices environments. New Relic is a full-stack observability platform. It links relevant data from a distributed application to provide a complete picture of your system. To try them out, specify an endpointAddress pointing to either a Jaeger or New Relic server in the Dapr configuration file. Here's an example of a configuration file that configures Dapr to send telemetry to a Jaeger server. The URL for Jaeger is identical to the URL for the Zipkin. The only difference is the number of the port on which the server runs:

apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: dapr-config
  namespace: default
spec:
  tracing:
    samplingRate: "1"
    zipkin:
      endpointAddress: "http://localhost:9415/api/v2/spans"

To try out New Relic, specify the endpoint of the New Relic API. Here's an example of a configuration file for New Relic:

apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: dapr-config
  namespace: default
spec:
  tracing:
    samplingRate: "1"
    zipkin:
      endpointAddress: "https://trace-api.newrelic.com/trace/v1?Api-Key=<NR-API-KEY>&Data-Format=zipkin&Data-Format-Version=2"

Check out the Jaeger and New Relic websites for more information on how to use them.

Metrics

Metrics provide insight into performance and resource consumption. Under the hood, Dapr emits a wide collection of system and runtime metrics. Dapr uses Prometheus as a metric standard. Dapr sidecars and system services, expose a metrics endpoint on port 9090. A Prometheus scraper calls this endpoint at a predefined interval to collect metrics. The scraper sends metric values to a monitoring back end. Figure 10-8 shows the scraping process:

Scraping Prometheus metrics

Figure 10-8. Scraping Prometheus metrics.

Each sidecar and system service exposes a metric endpoint that listens on port 9090. The Prometheus Metrics Scrapper captures metrics from each endpoint and published the information to the monitoring back end.

Service discovery

You might wonder how the metrics scraper knows where to collect metrics. Prometheus can integrate with discovery mechanisms built into target deployment environments. For example, when running in Kubernetes, Prometheus can integrate with the Kubernetes API to find all available Kubernetes resources running in the environment.

Metrics list

Dapr generates a large set of metrics for Dapr system services and its runtime. Some examples include:

Metric Source Description
dapr_operator_service_created_total System The total number of Dapr services created by the Dapr Operator service.
dapr_injector_sidecar_injection/requests_total System The total number of sidecar injection requests received by the Dapr Sidecar-Injector service.
dapr_placement_runtimes_total System The total number of hosts reported to the Dapr Placement service.
dapr_sentry_cert_sign_request_received_total System The number of certificate signing requests (CRSs) received by the Dapr Sentry service.
dapr_runtime_component_loaded Runtime The number of successfully loaded Dapr components.
dapr_grpc_io_server_completed_rpcs Runtime Count of gRPC calls by method and status.
dapr_http_server_request_count Runtime Number of HTTP requests started in an HTTP server.
dapr_http/client/sent_bytes Runtime Total bytes sent in request body (not including headers) by an HTTP client.

For more information on available metrics, see the Dapr metrics documentation.

Configure Dapr metrics

At run time, you can disable the metrics collection endpoint by including the --enable-metrics=false argument in the Dapr command. Or, you can also change the default port for the endpoint with the --metrics-port 9090 argument.

You can also use a Dapr configuration file to statically enable or disable runtime metrics collection:

apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: dapr-config
  namespace: dapr-trafficcontrol
spec:
  tracing:
    samplingRate: "1"
  metric:
    enabled: false

Visualize Dapr metrics

With the Prometheus scraper collecting and publishing metrics into the monitoring back end, how do you make sense of the raw data? A popular visualization tool for analyzing metrics is Grafana. With Grafana, you can create dashboards from the available metrics. Here's an example of a dashboard displaying Dapr system services metrics:

Grafana dashboard

Figure 10-9. Grafana dashboard.

The Dapr documentation includes a tutorial for installing Prometheus and Grafana.

Logging

Logging provides insight into what is happening with a service at run time. When running an application, Dapr automatically emits log entries from Dapr sidecars and Dapr system services. However, logging entries instrumented in your application code aren't automatically included. To emit logging from application code, you can import a specific SDK like OpenTelemetry SDK for .NET. Logging application code is covered later in this chapter in the section Using the Dapr .NET SDK.

Log entry structure

Dapr emits structured logging. Each log entry has the following format:

Field Description Example
time ISO8601 formatted timestamp 2021-01-10T14:19:31.000Z
level Level of the entry (debug, info, warn, or error) info
type Log Type log
msg Log Message metrics server started on :62408/
scope Logging Scope dapr.runtime
instance Hostname where Dapr runs TSTSRV01
app_id Dapr App ID finecollectionservice
ver Dapr Runtime Version 1.0

When searching through logging entries in a troubleshooting scenario, the time and level fields are especially helpful. The time field orders log entries so that you can pinpoint specific time periods. When troubleshooting, log entries at the debug level provide more information on the behavior of the code.

Plain text versus JSON format

By default, Dapr emits structured logging in plain-text format. Every log entry is formatted as a string containing key/value pairs. Here's an example of logging in plain text:

== DAPR == time="2021-01-12T16:11:39.4669323+01:00" level=info msg="starting Dapr Runtime -- version 1.0 -- commit 196483d" app_id=finecollectionservice instance=TSTSRV03 scope=dapr.runtime type=log ver=1.0
== DAPR == time="2021-01-12T16:11:39.467933+01:00" level=info msg="log level set to: info" app_id=finecollectionservice instance=TSTSRV03 scope=dapr.runtime type=log ver=1.0
== DAPR == time="2021-01-12T16:11:39.467933+01:00" level=info msg="metrics server started on :62408/" app_id=finecollectionservice instance=TSTSRV03 scope=dapr.metrics type=log ver=1.0

While simple, this format is difficult to parse. If viewing log entries with a monitoring tool, you'll want to enable JSON formatted logging. With JSON entries, a monitoring tool can index and query individual fields. Here's the same log entries in JSON format:

{"app_id": "finecollectionservice", "instance": "TSTSRV03", "level": "info", "msg": "starting Dapr Runtime -- version 1.0 -- commit 196483d", "scope": "dapr.runtime", "time": "2021-01-12T16:11:39.4669323+01:00", "type": "log", "ver": "1.0"}
{"app_id": "finecollectionservice", "instance": "TSTSRV03", "level": "info", "msg": "log level set to: info", "scope": "dapr.runtime", "type": "log", "time": "2021-01-12T16:11:39.467933+01:00", "ver": "1.0"}
{"app_id": "finecollectionservice", "instance": "TSTSRV03", "level": "info", "msg": "metrics server started on :62408/", "scope": "dapr.metrics", "type": "log", "time": "2021-01-12T16:11:39.467933+01:00", "ver": "1.0"}

To enable JSON formatting, you need to configure each Dapr sidecar. In self-hosted mode, you can specify the flag --log-as-json on the command line:

dapr run --app-id finecollectionservice --log-level info --log-as-json dotnet run

In Kubernetes, you can add a dapr.io/log-as-json annotation to each deployment for the application:

annotations:
   dapr.io/enabled: "true"
   dapr.io/app-id: "finecollectionservice"
   dapr.io/app-port: "80"
   dapr.io/config: "dapr-config"
   dapr.io/log-as-json: "true"

When you install Dapr in a Kubernetes cluster using Helm, you can enable JSON formatted logging for all the Dapr system services:

helm repo add dapr https://dapr.github.io/helm-charts/
helm repo update
kubectl create namespace dapr-system
helm install dapr dapr/dapr --namespace dapr-system --set global.logAsJson=true

Collect logs

The logs emitted by Dapr can be fed into a monitoring back end for analysis. A log collector is a component that collects logs from a system and sends them to a monitoring back end. A popular log collector is Fluentd. Check out the How-To: Set up Fluentd, Elastic search and Kibana in Kubernetes in the Dapr documentation. This article contains instructions for setting up Fluentd as log collector and the ELK Stack (Elastic Search and Kibana) as a monitoring back end.

Health status

The health status of a service provides insight into its availability. Each Dapr sidecar exposes a health API that can be used by the hosting environment to determine the health of the sidecar. The API has one operation:

GET http://localhost:3500/v1.0/healthz

The operation returns two HTTP status codes:

  • 204: When the sidecar is healthy
  • 500: when the sidecar isn't healthy

When running in self-hosted mode, the health API isn't automatically invoked. You can invoke the API though from application code or a health monitoring tool.

When running in Kubernetes, the Dapr sidecar-injector automatically configures Kubernetes to use the health API for executing liveness probes and readiness probes.

Kubernetes uses liveness probes to determine whether a container is up and running. If a liveness probe returns a failure code, Kubernetes will assume the container is dead and automatically restart it. This feature increases the overall availability of your application.

Kubernetes uses readiness probes to determine whether a container is ready to start accepting traffic. A pod is considered ready when all of its containers are ready. Readiness determines whether a Kubernetes service can direct traffic to a pod in a load-balancing scenario. Pods that aren't ready are automatically removed from the load-balancer.

Liveness and readiness probes have several configurable parameters. Both are configured in the container spec section of a pod's manifest file. By default, Dapr uses the following configuration for each sidecar container:

livenessProbe:
      httpGet:
        path: v1.0/healthz
        port: 3500
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds : 5
      failureThreshold : 3
readinessProbe:
      httpGet:
        path: v1.0/healthz
        port: 3500
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds : 5
      failureThreshold: 3

The following parameters are available for the probes:

  • The path specifies the Dapr health API endpoint.
  • The port specifies the Dapr health API port.
  • The initialDelaySecondsspecifies the number of seconds Kubernetes will wait before it starts probing a container for the first time.
  • The periodSeconds specifies the number of seconds Kubernetes will wait between each probe.
  • The timeoutSeconds specifies the number of seconds Kubernetes will wait on a response from the API before timing out. A timeout is interpreted as a failure.
  • The failureThresholdspecifies the number of failed status code Kubernetes will accept before considering the container not alive or not ready.

Dapr dashboard

Dapr offers a dashboard that presents status information on Dapr applications, components, and configurations. Use the Dapr CLI to start the dashboard as a web-application on the local machine on port 8080:

dapr dashboard

For Dapr application running in Kubernetes, use the following command:

dapr dashboard -k

The dashboard opens with an overview of all services in your application that have a Dapr sidecar. The following screenshot shows the Dapr dashboard for the Traffic Control sample application running in Kubernetes:

Dapr dashboard overview

Figure 10-10. Dapr dashboard overview.

The Dapr dashboard is invaluable when troubleshooting a Dapr application. It provides information about Dapr sidecars and system services. You can drill down into the configuration of each service, including the logging entries.

The dashboard also shows the configured components (and their configuration) for an application:

Dapr dashboard components

Figure 10-11. Dapr dashboard components.

There's a large amount of information available through the dashboard. You can discover it by running a Dapr application and browsing the dashboard.

Check out the Dapr dashboard CLI command reference in the Dapr docs for more information on the Dapr dashboard commands.

Use the Dapr .NET SDK

The Dapr .NET SDK doesn't contain any specific observability features. All observability features are offered at the Dapr level.

If you want to emit telemetry from your .NET application code, you should consider the OpenTelemetry SDK for .NET. The Open Telemetry project is cross-platform, open source, and vendor agnostic. It provides an end-to-end implementation to generate, emit, collect, process, and export telemetry data. There's a single instrumentation library per language that supports automatic and manual instrumentation. Telemetry is published using the Open Telemetry standard. The project has broad industry support and adoption from cloud providers, vendors, and end users.

Sample application: Dapr Traffic Control

Because the Traffic Control sample application runs with Dapr, all the telemetry described in this chapter is available. If you run the application and open the Zipkin web front end, you'll see end-to-end tracing. Figure 10-12 shows an example:

Zipkin end-to-end tracing example

Figure 10-12. Zipkin end-to-end tracing example.

This trace shows the communication that occurs when a speeding violation has been detected:

  1. An exiting vehicle triggers the MQTT input binding that sends a message containing the vehicle license number, lane, and timestamp.
  2. The MQTT input binding invokes the TrafficControl service with the message.
  3. The TrafficControl service retrieves the state for the vehicle, appends the entry, and saves the updated vehicle state back to the state store.
  4. The TrafficControl service publishes the speeding violation using pub/sub to the speedingviolations topic.
  5. The FineCollection service receives the speeding violation using a pub/sub subscription on the speedingviolations topic.
  6. The FineCollection service invokes the vehicleinfo endpoint of the VehicleRegistration service using service invocation.
  7. The FineCollection service invokes an output binding for sending the email.

Click any trace line (span) to see more details. If you click on the last line, you'll see the sendmail binding component invoked to send the driver a violation notice.

Output binding trace details

Figure 10-13. Output binding trace details.

Summary

Detailed observability is critical to running a distributed system in production.

Dapr provides different types of telemetry, including distributed tracing, logging, metrics, and health status.

Dapr only produces telemetry for the Dapr system services and sidecars. Telemetry from your application code isn't automatically included. You can however use a specific SDK like the OpenTelemetry SDK for .NET to emit telemetry from your application code.

Dapr telemetry is produced in an open-standards based format so that it can be ingested by a large set of available monitoring tools. Examples include Zipkin, Azure Application Insights, the ELK Stack, New Relic, and Grafana. See Monitor your application with Dapr in the Dapr documentation for tutorials on how to monitor your Dapr applications with specific monitoring back ends.

You'll need a telemetry scraper that ingests telemetry and publishes it to the monitoring back end.

Dapr can be configured to emit structured logging. Structured logging is favored as it can be indexed by back-end monitoring tools. Indexed logging enables users to execute rich queries when searching through the logging.

Dapr offers a dashboard that presents information about the Dapr services and configuration.

References