How to update scheduler and execute nodes with latest OS updates?

Gary Mansell 20 Reputation points
2024-07-30T08:11:52.1766667+00:00

Hi,

We need to ensure that both our Scheduler and Execute nodes are fully patched and up to date, including kernel packages and others that require a reboot after installation.

We are running the Microsoft HPC Ubuntu 22.04 image, and I know it is possible to update packages on every node startup using a YAML Cloud-Init script, but I suspect the need to reboot the nodes after a kernel update might cause this not to work with CycleCloud?

So, is there a way to do this with CycleCloud and Cloud-Init, or is it best just to take a copy of the Microsoft HPC Ubuntu 22.04 VM image and then customise it and apply any updates every month?

Azure CycleCloud
Azure CycleCloud
A Microsoft tool for creating, managing, operating, and optimizing high-performance computing (HPC) and big compute clusters in Azure.
66 questions
{count} votes

Accepted answer
  1. Prrudram-MSFT 25,471 Reputation points
    2024-08-07T17:13:28.81+00:00

    Hello @Gary Mansell

    Updating packages on every node startup using a YAML Cloud-Init script in CycleCloud can indeed be tricky, especially when kernel updates require a reboot. Here are some suggestions that I would like to make

    Using Cloud-Init with CycleCloud:-

    Cloud-Init Script: You can use a Cloud-Init script to update packages on startup. However, for kernel updates that require a reboot, you might need to handle the reboot process within the script. This can be done by scheduling a reboot after the updates are applied and ensuring the script runs again after the reboot.

    AT Command: Another approach is to use the at command within the Cloud-Init script to schedule tasks that need to run after the initial Cloud-Init process completes. This can help manage reboots and subsequent updates.

    Customizing the VM Image:-

    If managing reboots and updates through Cloud-Init scripts seems too complex, you might find it easier to:

    Create a Custom VM Image: Take a copy of the Microsoft HPC Ubuntu 22.04 VM image, apply all necessary updates, and then use this custom image for your nodes. This way, you can ensure that all nodes start with the latest updates and only need to apply incremental updates during their lifecycle.

    Monthly Updates: Regularly update your custom image (e.g., monthly) to include the latest patches and kernel updates. This reduces the need for extensive updates on node startup and minimizes downtime due to reboots.

    https://techcommunity.microsoft.com/t5/azure-high-performance-computing/azure-cyclecloud-cloud-init-with-linux-at-command/ba-p/4023980

    If I have answered your query, please click "Accept as answer" as a token of appreciation

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Gary Mansell 20 Reputation points
    2024-08-20T10:32:27.5033333+00:00

    Following up on this since speaking with one of the DEVs of the HPC images - they say to use their MS Supplied images as they are updated reasonably regularly and that they explicitly disable the Kernel updates as updating would break the other HPC software in the images:

    Thanks for raising this issue. Generally, OS kernel updates break compatibility of HPC components, e.g., Lustre. In our HPC images, the kernel is excluded from updates for this reason. Ubuntu 22.04: https://github.com/Azure/azhpc-images/blob/0b14bf6158ee5aaecf73f1e78ed9f0988bb722ed/ubuntu/ubuntu-22.x/ubuntu-22.04-hpc/install_prerequisites.sh#L5 AlmaLinux 8.7: https://github.com/Azure/azhpc-images/blob/master/alma/common/install_utils.sh#L66 We implement this way since lots of kernel dependencies are installed which are highly coupled to a specific kernel. Thus kernel updates are not encouraged in our HPC images. Our HPC image releasing cadence is quarterly. In the meantime, if we get flagged for security issues, we quickly apply the patch and release a hotfix in an adhoc fashion which can be done within a week or two. What you need to do is just to use our latest HPC images. Or you may report security bugs (and patches, if any) to us. We will apply the fix and release the patched images.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.