How to split a larger-than-disk file using Azure ML?

Question

Using Azure ML components and pipelines: How to split a larger-than-disk (PGN) file into shards and save the output files to a designated uri_folder on a blob storage? Feel free to provide any best-practices to achieve the goal.

I set up a component and a pipeline with the following yml configuration files:

Component

$schema: [https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json]()
name: split_file_to_shards
display_name: Split file to shards
version: 0.0.9
type: command

inputs:
  input_data_file:

type: uri_file
mode: ro_mount

outputs:
  output_data_dir:

type: uri_folder
mode: rw_mount

environment:
  image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest

code: ./
command: >-
  split -u -n r/100 --verbose ${{inputs.input_data_file}} ${{outputs.output_data_dir}}

Pipeline

$schema: [https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json]()
type: pipeline
experiment_name: sample-experiment

compute: azureml:vm-cluster-cpu

inputs:
  input_data_file:

type: uri_file
path: azureml:larger-than-disk-file@latest

outputs:
  output_data_dir:

type: uri_folder
path: azureml://datastores//paths//

jobs:
  split_pgn_to_shards:

type: command
component: azureml:split_file_to_shards@latest
inputs:
  input_data_file: ${{parent.inputs.input_data_file}}
outputs:
  output_data_dir: ${{parent.outputs.output_data_dir}}

Run commands

> az ml component create -f component.yml
> az ml job create -f pipeline.yml
> az ml job create -f pipeline.yml

I expect Azure ML to mount the input file on a ro_mount and write the processed files to rw_mount. I understood the remaining options download and upload to actively download the file to the VM's local disk and upload the files after processing to the mount, respectively, which is not what I want.

The command argument -u in split is used for unbuffered write to output.

From the monitoring Network I/O I unexpectedly see the file being downloaded to disk. In addition, I get the following error from the component:

Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU.
Total space: 6958 MB, available space: 1243 MB (under AZ_BATCH_NODE_ROOT_DIR).

Answer

When dealing with files larger than the available disk space in Azure ML, direct file mounting using ro_mount or rw_mount is usually not recommended because the entire file content is still pulled into memory, which can exhaust the resources and lead to errors as you've seen. The split command is also trying to load the entire file into memory, causing the error "Disk full while running job".

Here's a suggested approach to handle large file splitting within Azure ML using the Python SDK, which you would run on a Data Science Virtual Machine (DSVM) or on a compute instance:

Use Azure Blob Storage SDK: Read the large file directly from Azure Blob Storage using the azure.storage.blob Python SDK. This allows you to read the data as a stream and doesn't require downloading the entire file.

Split Data in Chunks: Process the data in chunks, writing each chunk out to a new file as it is processed. This way, you're never holding more than one chunk of data in memory at any time.

Write Each Chunk to Azure Blob Storage: Use the azure.storage.blob SDK to write each chunk to a new blob in Azure Blob Storage.

An example python script

from azure.storage.blob import BlobServiceClient, BlobClient

# Create BlobServiceClient object
service = BlobServiceClient.from_connection_string("")

# Create a blob client for the input file
blob_client = service.get_blob_client("", "")

# Create a blob client for the output file
output_client = service.get_blob_client("", "")

# Define chunk size (example: 1GB chunks)
CHUNK_SIZE = 1024 * 1024 * 1024

def chunk_file(file, chunk_size):
    """Read file in chunks."""
    while True:
        data = file.read(chunk_size)
        if not data:
            break
        yield data

# Read the large file in chunks and write each chunk to a new blob
for i, chunk in enumerate(chunk_file(blob_client.download_blob().readall(), CHUNK_SIZE)):
    output_blob_name = f"{output_client.blob_name}_{i}"
    print(f"Writing chunk {i} to {output_blob_name}")
    output_client = service.get_blob_client("", output_blob_name)
    output_client.upload_blob(chunk)

How to split a larger-than-disk file using Azure ML?

1 answer