How to disable Azure ML snapshots (for compute clusters)

Question

I am using the Azure ML Python SDK to spawn jobs using a prebuilt Docker image compute cluster. The Docker image has all of the dependencies installed and the source code I am running.

According to the logs, Azure ML's "snapshot" operations (that I assume upload and then download the source directory to Azure ML jobs) increase the startup time from 2 minutes to 8-20 minutes (i.e., the time it takes for an Azure ML run to begin running my code). By comparison, when I run the exact same code on a compute instance instead of a compute cluster, startup time is 60-80 seconds. Notably, the startup logs for the compute instance make no mention of "snapshots".

I already disabled saving snapshots for historical record, but that made little difference and the startup logs (for a cluster) still show operations for snapshots. I also significantly expanded our .amlignore file, which reduced the startup time by 10+ minutes, but 5-10 minutes are still spent on the "fetching snapshots" step (which we do not even use).

Questions:

What is this "fetching snapshot" operation if it is not the saving of snapshots for historical record (which I already disabled and confirmed)?
Why does this operation only occur for compute clusters but not compute instances?
Why is this operation so slow? I.e., 5-10 minutes to "fetch" less than 10 MB of python files.
Can I disable everything having to do with snapshots altogether? I assume this is possible because this occurs when using compute instances in Azure ML.

Thank you very much. Please let me know if there is anything more I can provide to help debug.

Answer

@Danny Thanks for the question. To prevent unnecessary files from being included in the snapshot, make an ignore file (.gitignore or .amlignore) in the directory. With regards to the snapshotted set of files, you could create a “.amlignore” file following a similar syntax as a gitignore files to prevent uploading of files as a snapshot with your runs. I could see you have leveraged this.

Please follow the doc for the same.

Generally Compute target takes a long time to start: The Docker images for compute targets are loaded from Azure Container Registry (ACR). By default, Azure Machine Learning creates an ACR that uses the basic service tier. Changing the ACR for your workspace to standard or premium tier may reduce the time it takes to build and load images. For more information, see Azure Container Registry service tiers.

Compute Instance is a managed cloud-based workstation for data scientists which makes it easy to get started with ML development on AzureML while providing management and enterprise readiness capabilities. Compute instances support the full lifecycle of inner-loop ML development on AzureML. Compute instance is fully customizable by data scientists and is tightly integrated with Azure ML workspace.
Compute instances can be used for running notebooks through first-class web experiences for popular tools such as JupyterLab, Jupyter, and integrated notebooks and running R scripts. Compute instance can also be used as a training compute target.

Share via

How to disable Azure ML snapshots (for compute clusters)

1 answer