When dealing with files larger than the available disk space in Azure ML, direct file mounting using ro_mount or rw_mount is usually not recommended because the entire file content is still pulled into memory, which can exhaust the resources and lead to errors as you've seen. The split command is also trying to load the entire file into memory, causing the error "Disk full while running job".
Here's a suggested approach to handle large file splitting within Azure ML using the Python SDK, which you would run on a Data Science Virtual Machine (DSVM) or on a compute instance:
- Use Azure Blob Storage SDK: Read the large file directly from Azure Blob Storage using the azure.storage.blob Python SDK. This allows you to read the data as a stream and doesn't require downloading the entire file.
Split Data in Chunks: Process the data in chunks, writing each chunk out to a new file as it is processed. This way, you're never holding more than one chunk of data in memory at any time.
- Write Each Chunk to Azure Blob Storage: Use the azure.storage.blob SDK to write each chunk to a new blob in Azure Blob Storage.
An example python script
from azure.storage.blob import BlobServiceClient, BlobClient
# Create BlobServiceClient object
service = BlobServiceClient.from_connection_string("<your-connection-string>")
# Create a blob client for the input file
blob_client = service.get_blob_client("<your-container>", "<your-large-file>")
# Create a blob client for the output file
output_client = service.get_blob_client("<your-container>", "<your-shard-file>")
# Define chunk size (example: 1GB chunks)
CHUNK_SIZE = 1024 * 1024 * 1024
def chunk_file(file, chunk_size):
"""Read file in chunks."""
while True:
data = file.read(chunk_size)
if not data:
break
yield data
# Read the large file in chunks and write each chunk to a new blob
for i, chunk in enumerate(chunk_file(blob_client.download_blob().readall(), CHUNK_SIZE)):
output_blob_name = f"{output_client.blob_name}_{i}"
print(f"Writing chunk {i} to {output_blob_name}")
output_client = service.get_blob_client("<your-container>", output_blob_name)
output_client.upload_blob(chunk)