Introduction

Completed

In the ever-evolving landscape of artificial intelligence (AI), efficient training workloads are crucial for achieving optimal model performance. Whether you’re a seasoned high-performance computing (HPC) enthusiast or a curious learner, this module equips you with practical insights to enhance your AI training workflows.

Scenario

In the manufacturing sector, a company specializing in automotive parts is looking to improve its AI models that predict machine failure and maintenance needs. They have vast amounts of data collected from sensors on their machinery but face challenges in processing and training AI models efficiently.

Management wants a model setup that enables them to run multiple training jobs in parallel, significantly reducing the time required to train and refine their predictive models.

What will we be doing?

In this module, we examine the intersection of cloud computing, HPC, and AI. Then, by leveraging Azure CycleCloud and Slurm, you will explore how to deploy and manage clusters, ensure node health, and fine-tune performance specifically for AI training tasks.

Learning objectives

By the end of this module, you're able to:

  • Be able to deploy a Slurm cluster using Azure CycleCloud.
  • Learn how to ensure your cluster is healthy through Node Health Checks.
  • Learn how to optimize your cluster for AI training workloads.

Prerequisites

  • Basic knowledge of HPC.
  • Understand the concepts of Azure Virtual Machines, subscription, resource groups, and virtual networks.
  • Ability to deploy VM, mounting shared file systems.

![NOTE] To complete the optional exercises, you need to use your own subscription, which might incur charges. A trial subscription or a subscription to which you already have access will work to follow along.