@Charles Thomas @TimboBaggins Apologies for the late reply. We appreciate your patience on this.
I have got an update from the Product Owners on this. Here is the root cause analysis and mitigation taken to fix this issue.
Cause:
The retry logic had a bug that, sometimes when the training job fails, the retry counter is reset so the job will be infinitely retried.
Mitigation:
- We have fixed the retry counter to catch that kind of failure and stop retrying.
- A hotfix has been deployed.
Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.
**
Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.