A Microsoft tool for creating, managing, operating, and optimizing high-performance computing (HPC) and big compute clusters in Azure.
Thank you for your patience. As discussed offline, we reviewed this behavior internally.
Cloud-init marks provisioning as failed only when it exits with a non-zero status. However, package installation errors do not always cause cloud-init to terminate in a failure state. In certain scenarios, cloud-init can log module-level errors (including package installation issues) but still complete overall execution successfully.
Since Azure CycleCloud/Jetpack determines node readiness based on the completion status of cloud-init, a node may be marked healthy if cloud-init exits successfully—even if a specific dependency later fails at runtime (for example, a Python module import).
To address this, you added an explicit validation step in the runcmd section to verify that the tqdm module can be imported. If the import fails, the provisioning workflow now exits with a non-zero status and logs an error through Jetpack, preventing the node from entering service in an inconsistent state. After implementing this change, the issue has not reoccurred.
At this time, the originally affected node is no longer available, and since the issue has not reproduced, we are unable to review the cloud-init status artifacts and logs from the failing node to determine the exact underlying root cause.
If the issue occurs again, collecting the cloud-init status output (such as the JSON summary) and relevant cloud-init logs from the affected node will allow for deeper analysis and confirmation of the specific failure condition.