Share via

Azure Machine Learning serverless job disk size

Vlada Shram 40 Reputation points
2025-09-29T16:22:46.2166667+00:00

Hi,

According to the Azure Machine Learning compute instance documentation:

  • The /tmp directory can be used for temporary data, but the OS disk is limited to 120GB.

Temporary training data can be stored on the temporary disk mounted at /mnt. The temporary disk size is based on VM size and can handle larger datasets with higher-size VMs.

Installed software packages go on the OS disk.

OS disk is encrypted with Microsoft-managed keys (CMK not supported).

When I run an Azure ML job (serverless), it fails while trying to download ~73GB of data into /mnt/my-folder.

My compute target is: Standard_NC48ads_A100_v4

Questions:

Shouldn’t the /mnt temporary disk on this VM size be large enough to store 73GB?

Is there a limitation specific to serverless jobs that prevents using /mnt the same way as on compute instances?

Are there recommended workarounds for handling larger input datasets in serverless jobs (e.g., mounting Blob/ADLS storage instead of downloading to /mnt)?

Thanks in advance for any guidance!

Azure Machine Learning

Answer accepted by question author

Marcin Policht 92,630 Reputation points MVP Volunteer Moderator
2025-09-29T17:49:58.9+00:00
  1. Does the /mnt "temporary disk" size support 73 GB for that VM?

For a "full VM" or compute instance or cluster, the OS + "temporary disk" (ephemeral) volumes are provisioned with sizes that depend on the VM SKU. In principle, a Standard_NC48ads_A100_v4 should have substantial disk / ephemeral storage, so 73 GB is unlikely to exceed what the raw VM can host if fully provisioned in a "direct VM compute" scenario. But, in your case, there are a few caveats:

  • The "temporary disk" is ephemeral and may have reserved space (some fraction is reserved for system or swap).
  • Other processes, caches, or layers (e.g. FUSE, caching, overlay) might reduce effective usable space.
  • If your job also writes intermediates, logs, or other artifacts under /mnt, those consume additional space.

So the failure suggests that something else is constraining storage, beyond just the raw VM allocation.

  1. Are there serverless-job–specific limitations that alter how /mnt works (or is exposed)?

Yep — the "serverless compute" abstraction brings additional constraints and behavior differences vs. when you manage the VM directly. Some important points:

  1. When you run a job with compute = serverless (or omit compute), Azure ML handles instantiating a VM behind the scenes per job, rather than you directly provisioning a VM. The serverless subsystem can impose constraints (e.g. disk quotas, caching, snapshot overlays) that you don't see under a full VM.
  2. The Azure ML Data runtime handles how input dataset URIs are accessed in a job (download, mount/stream, or hybrid). If your input is specified with mode = download, the runtime will download the entire dataset locally (i.e. into the local compute's storage). That relies on having sufficient local disk. If mode = mount, the runtime uses a FUSE-like overlay filesystem (emulated mount) and caches accessed files locally (with configurable cache size) rather than bringing the entire dataset upfront. That mode can avoid a full 73 GB local allocation. There are environment-variable knobs controlling mount cache size (DATASET_MOUNT_CACHE_SIZE), caching behavior, and reserved free space.

If your job is forcing download mode into /mnt/my-folder, then you're effectively asking the serverless instance to commit that much space locally, which may exceed what the runtime allows or what has been allocated to that serverless execution container.

  1. Under serverless, jobs are often run in containerized or ephemeral execution environments. The abstraction may impose per-job disk limits, container overlay limits, or quotas, which may not map perfectly to the full VM's /mnt capability. Even if a VM could support 500 GB, the serverless layer might only allocate 100 GB for a job's local scratch.
  2. The data runtime won't use the last RESERVED_FREE_DISK_SPACE bytes of disk (default ~150 MB) to keep the compute healthy. Also, mount caching may prune as necessary if the cache grows too large. It's possible the serverless job environment has additional reserved buffers or safety caps to avoid "disk full" errors.
  3. The serverless environment may use snapshot layers or ephemeral overlay file systems that penalize or limit large writes. The abstraction might not map /mnt 1:1 to a raw ephemeral disk as in a full VM.

These constraints mean that the serverless execution environment may not truly give you the full raw /mnt allocation you would expect from a fully provisioned VM. The failure suggests that your job's local storage request exceeded what the serverless environment allowed or that the data runtime chose a mode that compels local copies beyond allowed limits.

  1. Recommended workarounds / best practices for large input datasets in serverless jobs

Given these constraints, here are best practices and workaround strategies to handle large datasets (like ~73 GB) in serverless Azure ML jobs:

Strategy Description / How to apply Pros / Tradeoffs
Use mount (streaming) mode instead of download In your job spec or SDK, request the input dataset in mount (FUSE) mode (read-only or read-write as needed). That way, only accessed files are fetched and cached; the full dataset need not be held locally. Reduces upfront local storage needs; can avoid "disk full" if your access patterns are partial. But adds I/O latency and FUSE overhead, and heavily random access can degrade performance.
Partition / chunk the dataset before bringing it in Break your ~73 GB into smaller shards, stage and process them in segments (e.g. per day, per batch). Your job can loop over partitions, or you can trigger multiple jobs. Easier to absorb within disk limits; parallelizable. But increases orchestration overhead and may require recombining results.
Stage data on remote storage and stream per record If your training code allows streaming (e.g. reading from blob/ADLS URI directly per data point), avoid downloading the entire dataset upfront. Minimal local storage; openness to scale. But may limit random access or expensive seeks; you lose some of the performance gains of local SSD access.
Mount blob/ADLS via blobfuse, NFS, or similar in your script Instead of using ML's data runtime download, you can mount remote storage inside the job container (if permitted) and read from it. Bypasses local disk limits. However, mounting from user scripts may be restricted by network, container sandbox, or permissions. Also can hit I/O and latency bottlenecks.
Use pipeline stages / streaming transforms (e.g. incremental ingestion) If your data is sensor / streaming style or partitioned by time, process incrementally rather than load all at once. More scalable and memory efficient. May require redesign of your pipeline.
Switch to a provisioned compute cluster / VM (rather than serverless) If your job consistently needs large local disk, consider using a managed compute cluster or VM compute target (e.g. Standard_NC48ads_A100_v4) directly. Then you have more control over disk sizing and access. More flexibility and capacity; less risk of hitting opaque quotas. But you lose the simplicity and autoscaling benefits of serverless.
Tune mount caching parameters Increase DATASET_MOUNT_CACHE_SIZE, adjust caching prune thresholds, and tune environment vars to optimize how the mount cache is utilized. Helps maximize usable local cache for the mount strategy. But still bounded by runtime limits.
Profile and purge intermediate files Ensure your job is not inadvertently writing large intermediates or temp files into /mnt or working directory beyond what's necessary. Clear logs, artifacts, temporary outputs proactively. Conservative housekeeping can avoid hitting spillover. But it's more of a hygiene than a full solution.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

Was this answer helpful?

0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Vlada Shram 40 Reputation points
    2025-10-02T19:59:54.9866667+00:00

    Thank you! I used datastore to access blob from Storage and it worked

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.