How to measure network latency in AKS or App service?

Tanul 1,251 Reputation points
2021-03-13T10:13:15.05+00:00

Team,

Trust all are doing great.

For hosting rest api, we can use Azure Kubernetes or App services but how to measure the network latency especially after months of hosting them in production.

At times, we host single file binaries made in .net core or golang or rust in these platforms but after few months we feel latency in them.

Can any one suggest some software or any design pattern(in the programming language) which can be used/followed to measure the Network latency in the api especially when hosted in Azure Kubernetes.

Please help because network latency is becoming a major concern especially with public clouds.

Thank you

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,854 questions
Azure App Service
Azure App Service
Azure App Service is a service used to create and deploy scalable, mission-critical web apps.
6,863 questions
0 comments No comments
{count} votes

Accepted answer
  1. KarishmaTiwari-MSFT 18,367 Reputation points Microsoft Employee
    2021-03-18T23:11:09.427+00:00

    Hi @Tanul

    -There has been an ongoing issue with Q&A where the activity for some accounts is not showing up. The dev team is investigating this and working on resolving the issue.

    -Regarding your question,
    AKS Engineering has identified an issue leading to customers reporting service, workload and networking instability when running under load or with large numbers of ephemeral, periodic events (jobs). These failures are the result of Disk IO saturation and throttling at the file operation (IOPS) level.

    Worker node VMs running customer workloads are regularly disk IO throttled/saturated on all VM operating system disks due to the underlying quota of the storage device potentially leading to cluster and workload failure.

    This issue should be investigated (as documented in the link below) if you are seeing worker node/workload or API server unavailability. This issue can lead to NodeNotReady and loss of cluster availability in extreme cases.

    Issue Identification using the prometheus operator (recommended)
    The prometheus operator project provides a best practice set of monitoring and metrics for Kubernetes that covers all of the metrics above and more.

    We recommend the operator as it provides both a simple (helm) based installation as well as all of the prometheus monitoring, grafana charts, configuration and default metrics critical to understanding performance, latency and stability issues such as this.

    Additionally the prometheus operator deployment is specifically designed to be highly available - this helps significantly in availability scenarios that could risk missing metrics due to container/cluster outages.

    Customers are encouraged to examine and implement using their own metrics/monitoring pipeline copying the the USE (Utilization and Saturation) metrics/dashboard, as well as the pod-level and namespace node level utilization reports from the operator. Additionally the node reports clearly display OS disk saturation leading to high levels of system latency and degraded application/cluster performance.

    Please find a very detailed description of the issue as well as recommendations here: https://github.com/Azure/AKS/issues/1373

    0 comments No comments

0 additional answers

Sort by: Most helpful