Hi @Tanul
-There has been an ongoing issue with Q&A where the activity for some accounts is not showing up. The dev team is investigating this and working on resolving the issue.
-Regarding your question,
AKS Engineering has identified an issue leading to customers reporting service, workload and networking instability when running under load or with large numbers of ephemeral, periodic events (jobs). These failures are the result of Disk IO saturation and throttling at the file operation (IOPS) level.
Worker node VMs running customer workloads are regularly disk IO throttled/saturated on all VM operating system disks due to the underlying quota of the storage device potentially leading to cluster and workload failure.
This issue should be investigated (as documented in the link below) if you are seeing worker node/workload or API server unavailability. This issue can lead to NodeNotReady and loss of cluster availability in extreme cases.
Issue Identification using the prometheus operator (recommended)
The prometheus operator project provides a best practice set of monitoring and metrics for Kubernetes that covers all of the metrics above and more.
We recommend the operator as it provides both a simple (helm) based installation as well as all of the prometheus monitoring, grafana charts, configuration and default metrics critical to understanding performance, latency and stability issues such as this.
Additionally the prometheus operator deployment is specifically designed to be highly available - this helps significantly in availability scenarios that could risk missing metrics due to container/cluster outages.
Customers are encouraged to examine and implement using their own metrics/monitoring pipeline copying the the USE (Utilization and Saturation) metrics/dashboard, as well as the pod-level and namespace node level utilization reports from the operator. Additionally the node reports clearly display OS disk saturation leading to high levels of system latency and degraded application/cluster performance.
Please find a very detailed description of the issue as well as recommendations here: https://github.com/Azure/AKS/issues/1373