An Azure service that provides cloud-scale job scheduling and compute management.
Hello Rahul,
Thank you for patience while we investigation on this issue.
Based on the backend investigation, the node entered a stuck state during the reboot and recovery process. During automated recovery, Azure Batch attempted to perform pool deployment cleanup and recovery operations; however, these operations could not be completed because networking resources associated with the BYOvNet pool were still reported as being in use. As a result, node recovery could not complete successfully, and subsequent pool operations remained in a stuck state.
Additionally, the resize failure was attributed to the Batch account temporarily reaching a quota/resource allocation limit for compute resources in the region. When this limit is reached, Azure Batch may be unable to allocate or recover additional nodes successfully, which can cause operations such as reboot, resize, or delete to remain in a pending state until backend reconciliation is completed.
To help reduce the likelihood of similar issues in the future, we recommend:
• Periodically refreshing long-running pools.
• Monitoring node and pool health on a regular basis.
• Reviewing and validating networking resource configurations associated with BYOvNet pools.
• Monitoring Batch account quotas and resource utilization regularly.
• Configuring alerts for nodes that remain in rebooting, starting, or unusable states for extended periods.
• Considering a quota increase if additional compute capacity may be required in the future.
For additional reference, please review the below documentation:
Hope this helps! Please let me know if you have any queries.