Programatically creating, attaching, and detaching managed disks from VMs

Question

I'm trying to programmatically attach, create, and detach managed disks to Linux Compute VMs in large numbers in quick succession using the REST API. For example, attaching and detaching 32 disks repeatedly within 10 seconds. I am running into issues where disks that should be detached from a specific LUN are not actually detached despite using locks in my application to make each attach and detach operation sequential and checking the state of the disk using a GET command to the disks REST API to make sure the disk is either "Attached" or "Unattached" before trying to reuse the LUN for that disk. For example, I get this error message: "Cannot attach data disk 'disk-name' to VM 'vm-name' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again. Instructions can be found at https://aka.ms/AzureDiskDetached".

I believe the correct way to create and attach a new disk is to make a PATCH request to the VM for create and update with just the "dataDisks" field. I then append the new disk configuration to the existing dataDisks configuration. I believe I can also detach a disk from the VM by omitting the disk object in the PATCH request in the "dataDisks" field. Are there any additional requirements regarding what is specified in the "dataDisks" array? For example, it seems like Azure behind the scenes is updating fields in the "dataDisks" schema from looking at the UI change history. Do I need to refresh what I think the schema for "dataDisks" should be from the InstanceView of the VM before making a new request? Does the order of the objects in the dataDisks array matter at all?

Also, it would be nice if there was an explicit API for attaching and detaching disks rather than having to rely on knowing the overall VM configuration for all disks.

Answer

@jigold First, with some odds and ends -

When you update an existing VM with a new disk configuration, it will attempt to update the VM with whatever that new configuration is. While the VM is updating to the new configuration, the VM will be in the 'updating' state, and the update operation will not complete. I am not sure if a global lock is needed, but a lock for each VM will work as well.

The approach you are taking looks good and should work, but keep in mind that the LUN on your VM object might not line up with the LUN on your OS. For troubleshooting failures, look at the failed operation and try to see what the VM's configuration was before & after the operation, along with the configuration you are passing. You should be able to find an inconsistency with the two.

If your VMs are not properly updating to a desired state and not giving a proper error, you can speak with support to see what is happening.

Answer

We have built a system for processing batch jobs on top of the Azure ecosystem. Depending on the storage requirements of the job, we need to mount extra storage capacity for that job. Since jobs can take 0.5 second each and we allow 0.25 core jobs, we need to support a high turnover of disks and the capacity needed is not known ahead of time. We also need to make sure we garbage collect the disks properly so we do not leak excess disks or have LUN slots on the machine become unavailable because a disk did not unattach appropriately.

We have backoff and retry logic in our code. The approach is as follows:

Acquire the global update lock
Make a PATCH request to add a disk by adding a new spec to a data disks array we update in memory
Once the PATCH request returns with OK, we give up the lock to another thread / coroutine and then poll the disk state using a GET request waiting for the state to be "Attached"
Format the disk
Execute the job
Unmount the disk
Acquire the global update lock.
Remove the spec for that disk from the data disks array we maintain in memory by making another PATCH request.
Once the PATCH request returns with OK, we give up the lock to another thread / coroutine and then poll the disk state using a GET request waiting for the state to be "Unattached"
We make a DELETE request to delete the disk.

If there is an error, we always attempt to detach the disk AND delete the disk.

Is this the general approach I should take? Should I be force-detaching the disks instead of removing the specs entirely from the dataDisks spec? I'm concerned I need to keep the global update lock until the disk has reached the desired "Attached" or "Unattached" state. If this is the case, then our throughput on how many jobs we can handle is quite limited as we can't attach/detach disks in parallel.

Share via

Programatically creating, attaching, and detaching managed disks from VMs

2 answers