Share via

cyclecloud + cloud-init failures

Anonymous
2026-03-03T15:49:04.97+00:00

Hello Team,

We are experiencing intermittent issues with compute nodes where some required Python libraries fail to install correctly.

When this issue occurs, we have identified that the tqdm Python library fails to load. As a result, although the node appears healthy and starts processing jobs, those jobs fail almost immediately because the tqdm module cannot be imported. This is impacting our users, as we cannot promptly detect that a node is in a bad state, and jobs continue to fail as  they are scheduled on the affected node.

We use cloud-init to install all required system packages and Python libraries during node provisioning.

During our investigation, we noticed in the logs that two packages (python3-pylint-django and pylint-doc) were not available in the repositories. We removed these packages from our cloud-init configuration; however, the issue with tqdm still persists in some cases (though not consistently across all nodes).

Attached are the captured logs from one of the nodes exhibiting this behavior. Please note that the logs also contain the earlier errors related to python3-pylint-django and pylint-doc.

Azure CycleCloud
Azure CycleCloud

A Microsoft tool for creating, managing, operating, and optimizing high-performance computing (HPC) and big compute clusters in Azure.

0 comments No comments

Answer accepted by question author

Jilakara Hemalatha 13,505 Reputation points Microsoft External Staff Moderator
2026-03-03T17:11:18.8566667+00:00

Hello

Thank you for your patience. As discussed offline, we reviewed this behavior internally.

Cloud-init marks provisioning as failed only when it exits with a non-zero status. However, package installation errors do not always cause cloud-init to terminate in a failure state. In certain scenarios, cloud-init can log module-level errors (including package installation issues) but still complete overall execution successfully.

Since Azure CycleCloud/Jetpack determines node readiness based on the completion status of cloud-init, a node may be marked healthy if cloud-init exits successfully—even if a specific dependency later fails at runtime (for example, a Python module import).

To address this, you added an explicit validation step in the runcmd section to verify that the tqdm module can be imported. If the import fails, the provisioning workflow now exits with a non-zero status and logs an error through Jetpack, preventing the node from entering service in an inconsistent state. After implementing this change, the issue has not reoccurred.

At this time, the originally affected node is no longer available, and since the issue has not reproduced, we are unable to review the cloud-init status artifacts and logs from the failing node to determine the exact underlying root cause.

If the issue occurs again, collecting the cloud-init status output (such as the JSON summary) and relevant cloud-init logs from the affected node will allow for deeper analysis and confirmation of the specific failure condition.

Was this answer helpful?

1 person found this answer helpful.
0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.