An Azure machine learning service for building and deploying models.
- Does the /mnt "temporary disk" size support 73 GB for that VM?
For a "full VM" or compute instance or cluster, the OS + "temporary disk" (ephemeral) volumes are provisioned with sizes that depend on the VM SKU. In principle, a Standard_NC48ads_A100_v4 should have substantial disk / ephemeral storage, so 73 GB is unlikely to exceed what the raw VM can host if fully provisioned in a "direct VM compute" scenario. But, in your case, there are a few caveats:
- The "temporary disk" is ephemeral and may have reserved space (some fraction is reserved for system or swap).
- Other processes, caches, or layers (e.g. FUSE, caching, overlay) might reduce effective usable space.
- If your job also writes intermediates, logs, or other artifacts under
/mnt, those consume additional space.
So the failure suggests that something else is constraining storage, beyond just the raw VM allocation.
- Are there serverless-job–specific limitations that alter how /mnt works (or is exposed)?
Yep — the "serverless compute" abstraction brings additional constraints and behavior differences vs. when you manage the VM directly. Some important points:
- When you run a job with
compute = serverless(or omit compute), Azure ML handles instantiating a VM behind the scenes per job, rather than you directly provisioning a VM. The serverless subsystem can impose constraints (e.g. disk quotas, caching, snapshot overlays) that you don't see under a full VM. - The Azure ML Data runtime handles how input dataset URIs are accessed in a job (download, mount/stream, or hybrid). If your input is specified with mode =
download, the runtime will download the entire dataset locally (i.e. into the local compute's storage). That relies on having sufficient local disk. If mode =mount, the runtime uses a FUSE-like overlay filesystem (emulated mount) and caches accessed files locally (with configurable cache size) rather than bringing the entire dataset upfront. That mode can avoid a full 73 GB local allocation. There are environment-variable knobs controlling mount cache size (DATASET_MOUNT_CACHE_SIZE), caching behavior, and reserved free space.
If your job is forcing download mode into /mnt/my-folder, then you're effectively asking the serverless instance to commit that much space locally, which may exceed what the runtime allows or what has been allocated to that serverless execution container.
- Under serverless, jobs are often run in containerized or ephemeral execution environments. The abstraction may impose per-job disk limits, container overlay limits, or quotas, which may not map perfectly to the full VM's
/mntcapability. Even if a VM could support 500 GB, the serverless layer might only allocate 100 GB for a job's local scratch. - The data runtime won't use the last
RESERVED_FREE_DISK_SPACEbytes of disk (default ~150 MB) to keep the compute healthy. Also, mount caching may prune as necessary if the cache grows too large. It's possible the serverless job environment has additional reserved buffers or safety caps to avoid "disk full" errors. - The serverless environment may use snapshot layers or ephemeral overlay file systems that penalize or limit large writes. The abstraction might not map
/mnt1:1 to a raw ephemeral disk as in a full VM.
These constraints mean that the serverless execution environment may not truly give you the full raw /mnt allocation you would expect from a fully provisioned VM. The failure suggests that your job's local storage request exceeded what the serverless environment allowed or that the data runtime chose a mode that compels local copies beyond allowed limits.
- Recommended workarounds / best practices for large input datasets in serverless jobs
Given these constraints, here are best practices and workaround strategies to handle large datasets (like ~73 GB) in serverless Azure ML jobs:
| Strategy | Description / How to apply | Pros / Tradeoffs |
|---|---|---|
Use mount (streaming) mode instead of download |
In your job spec or SDK, request the input dataset in mount (FUSE) mode (read-only or read-write as needed). That way, only accessed files are fetched and cached; the full dataset need not be held locally. | Reduces upfront local storage needs; can avoid "disk full" if your access patterns are partial. But adds I/O latency and FUSE overhead, and heavily random access can degrade performance. |
| Partition / chunk the dataset before bringing it in | Break your ~73 GB into smaller shards, stage and process them in segments (e.g. per day, per batch). Your job can loop over partitions, or you can trigger multiple jobs. | Easier to absorb within disk limits; parallelizable. But increases orchestration overhead and may require recombining results. |
| Stage data on remote storage and stream per record | If your training code allows streaming (e.g. reading from blob/ADLS URI directly per data point), avoid downloading the entire dataset upfront. | Minimal local storage; openness to scale. But may limit random access or expensive seeks; you lose some of the performance gains of local SSD access. |
| Mount blob/ADLS via blobfuse, NFS, or similar in your script | Instead of using ML's data runtime download, you can mount remote storage inside the job container (if permitted) and read from it. | Bypasses local disk limits. However, mounting from user scripts may be restricted by network, container sandbox, or permissions. Also can hit I/O and latency bottlenecks. |
| Use pipeline stages / streaming transforms (e.g. incremental ingestion) | If your data is sensor / streaming style or partitioned by time, process incrementally rather than load all at once. | More scalable and memory efficient. May require redesign of your pipeline. |
| Switch to a provisioned compute cluster / VM (rather than serverless) | If your job consistently needs large local disk, consider using a managed compute cluster or VM compute target (e.g. Standard_NC48ads_A100_v4) directly. Then you have more control over disk sizing and access. | More flexibility and capacity; less risk of hitting opaque quotas. But you lose the simplicity and autoscaling benefits of serverless. |
| Tune mount caching parameters | Increase DATASET_MOUNT_CACHE_SIZE, adjust caching prune thresholds, and tune environment vars to optimize how the mount cache is utilized. |
Helps maximize usable local cache for the mount strategy. But still bounded by runtime limits. |
| Profile and purge intermediate files | Ensure your job is not inadvertently writing large intermediates or temp files into /mnt or working directory beyond what's necessary. Clear logs, artifacts, temporary outputs proactively. |
Conservative housekeeping can avoid hitting spillover. But it's more of a hygiene than a full solution. |
If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.
hth
Marcin