Best compute cluster for training large image datasets !

Okwen, Roland T 26 Reputation points

Good morning,
I have a a dataset that consist of 99000 (256 x 256 pixels) images. I am trying to use this dataset to training a generative advesarial network (GAN) for at least a 1,000 epoch.
Currently, I am using a standard_NC24r (24 cores, 224 GB RAM, 1440 GB disk) GPU (4 x NVIDIA Tesla K80) cluster but the training is slow. It takes about 3000 seconds to train 1 epoch. This implies it would take at least a month to complete training.
Is a cluster that I can used to speed up training?

Thanks for your help in advance

Many thanks


Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,714 questions
{count} votes

Accepted answer
  1. Ramr-msft 17,731 Reputation points

    @Okwen, Roland T Thanks, Instead of bigger machines with more memory, there are techniques to be used with Aml Compute for larger datasets. The Parallel Run Step is an AzureML Pipeline Step which enables parallel processing or data partitions across multiple workers on multiple nodes. PRS (ParallelRunStep) is designed for embarrassingly parallel workload, e.g. train many models, batch inference, etc.

    Also look into using some of the curated images provided for compute clusters.
    Specifically look into the DASK image.

    Curated environments - Azure Machine Learning | Microsoft Learn

0 additional answers

Sort by: Most helpful