How to split a larger-than-disk file using Azure ML?

Felix Sebastian Herzog 0 Reputation points
2023-05-18T13:46:14.8033333+00:00

Using Azure ML components and pipelines: How to split a larger-than-disk (PGN) file into shards and save the output files to a designated uri_folder on a blob storage? Feel free to provide any best-practices to achieve the goal.

I set up a component and a pipeline with the following yml configuration files:

Component

$schema: [https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json]()
name: split_file_to_shards
display_name: Split file to shards
version: 0.0.9
type: command

inputs:
  input_data_file:

type: uri_file
mode: ro_mount

outputs:
  output_data_dir:

type: uri_folder
mode: rw_mount

environment:
  image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest

code: ./
command: >-
  split -u -n r/100 --verbose ${{inputs.input_data_file}} ${{outputs.output_data_dir}}

Pipeline

$schema: [https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json]()
type: pipeline
experiment_name: sample-experiment

compute: azureml:vm-cluster-cpu

inputs:
  input_data_file:

type: uri_file
path: azureml:larger-than-disk-file@latest

outputs:
  output_data_dir:

type: uri_folder
path: azureml://datastores/<blob_storage_name>/paths/<path_to_folder>/

jobs:
  split_pgn_to_shards:

type: command
component: azureml:split_file_to_shards@latest
inputs:
  input_data_file: ${{parent.inputs.input_data_file}}
outputs:
  output_data_dir: ${{parent.outputs.output_data_dir}}

Run commands

> az ml component create -f component.yml
> az ml job create -f pipeline.yml
> az ml job create -f pipeline.yml

I expect Azure ML to mount the input file on a ro_mount and write the processed files to rw_mount. I understood the remaining options download and upload to actively download the file to the VM's local disk and upload the files after processing to the mount, respectively, which is not what I want.

The command argument -u in split is used for unbuffered write to output.

From the monitoring Network I/O I unexpectedly see the file being downloaded to disk. In addition, I get the following error from the component:

Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU.
Total space: 6958 MB, available space: 1243 MB (under AZ_BATCH_NODE_ROOT_DIR).
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,709 questions
Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
1,064 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sedat SALMAN 13,270 Reputation points
    2023-05-19T06:51:59.5566667+00:00

    When dealing with files larger than the available disk space in Azure ML, direct file mounting using ro_mount or rw_mount is usually not recommended because the entire file content is still pulled into memory, which can exhaust the resources and lead to errors as you've seen. The split command is also trying to load the entire file into memory, causing the error "Disk full while running job".

    Here's a suggested approach to handle large file splitting within Azure ML using the Python SDK, which you would run on a Data Science Virtual Machine (DSVM) or on a compute instance:

    1. Use Azure Blob Storage SDK: Read the large file directly from Azure Blob Storage using the azure.storage.blob Python SDK. This allows you to read the data as a stream and doesn't require downloading the entire file.

    Split Data in Chunks: Process the data in chunks, writing each chunk out to a new file as it is processed. This way, you're never holding more than one chunk of data in memory at any time.

    1. Write Each Chunk to Azure Blob Storage: Use the azure.storage.blob SDK to write each chunk to a new blob in Azure Blob Storage.

    An example python script

    from azure.storage.blob import BlobServiceClient, BlobClient
    
    # Create BlobServiceClient object
    service = BlobServiceClient.from_connection_string("<your-connection-string>")
    
    # Create a blob client for the input file
    blob_client = service.get_blob_client("<your-container>", "<your-large-file>")
    
    # Create a blob client for the output file
    output_client = service.get_blob_client("<your-container>", "<your-shard-file>")
    
    # Define chunk size (example: 1GB chunks)
    CHUNK_SIZE = 1024 * 1024 * 1024
    
    def chunk_file(file, chunk_size):
        """Read file in chunks."""
        while True:
            data = file.read(chunk_size)
            if not data:
                break
            yield data
    
    # Read the large file in chunks and write each chunk to a new blob
    for i, chunk in enumerate(chunk_file(blob_client.download_blob().readall(), CHUNK_SIZE)):
        output_blob_name = f"{output_client.blob_name}_{i}"
        print(f"Writing chunk {i} to {output_blob_name}")
        output_client = service.get_blob_client("<your-container>", output_blob_name)
        output_client.upload_blob(chunk)
    
    
    
    
    0 comments No comments