Hello @Sergey Stoma , there is an internal incident (not shareable publicly).
@Asjana , Yusef , for the recent issue/bug which started in April last week, the Product Group team shared the below RCA:
Incident Summary
Starting April 24th at 8:50 UTC, several customers started reporting a dip in their backend health status. The problem continued till May 10th as we investigated the root cause.
Upon investigation it was observed that health probes from AFD to backends were facing intermittent connection failures, which started showing up as a dip in the backend health metric. Even though there were health probe failures, there was no impact on the regular traffic of customers.
Root Cause
Starting from the last week of April, a scheduled OS upgrade was being rolled out to several AFD environments. Due to a bug in the newer OS version, a system configuration parameter was not set to an appropriate high value.
This parameter is usually set to an appropriate high value in anticipation of the higher load the health probe service (in available machines) will face when a subset of AFD machines is temporarily taken offline during OS upgrades.
But because of the new default lower value set by the bug, the health probe service started hitting the limits when machines were being taken offline for the upgrade. This caused intermittent connection failures in the health probes among a subset of customer origins.
Mitigation
To mitigate the problem, we overrode the lower default value introduced by the bug and set the appropriate high limit needed for the health probe service. We will roll out a permanent fix for the bug by the end of June.
Next steps and repair items
We deeply apologize for this incident and for any inconvenience it has caused. In our continuous efforts to improve platform reliability, we will be performing below repairs
- Identify system level settings that had lower default values in the new OS version and configured more appropriate values for them.
- Add metrics to track connections usage in health probe service and monitoring based on it.
- Roll out the fix for the bug in the latest OS version. [rollout completion end of June]
- Add monitoring to detect changes in configured limits in subsequent OS updates [end of June]
The above fix was supposed to be rolled out by end of June, but I will check with the Product Group team for any changes in the ETA.
Regards,
Gita