Understanding your GPU Performance on Azure with GPU Monitor

So I get lots of questions from Academics.

Many are now around performance and optimisation of cloud services. Or simply understanding what students are doing with the resources.

Many are specifically around the measurement and management of Azure GPS being used in the teaching of DNN, ML and AIhttps://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpuThe most common is 'what's the best practice for monitoring GPU cores/RAM usage on N-series DSVM(s)?'

So there are solutions like logging into each VM and running "watch nvidia-smi" but this simply is not scalable and complex to manage across an estate of machines or clusters.

So the request is how can I do this simply and have a nice visual of usage across my class or cohort.

So wouldn't it be great is to have a single view of the utilisation in some form of dashboard visual.

Well you now can! Thanks to some Microsoft colleagues Mathew Salvaris and Miguel Fierro. They have created an app for monitoring GPUs on a single machine and across a clusters.

You can use it to record various GPU measurements during a specific period using the context based loggers or continuously using the gpumon cli command. The context logger can either record to a file, which can be read back into a dataframe, or to an InfluxDB database.

Data from the InfluxDB database can then be accessed using the python InfluxDB client or can be viewed in realtime using dashboards such as Grafana.

They have a great example which is available in Jupyter notebook and can be found here

Below is an example dashboard using the InfluxDB log context and a Grafana dashboard

You can download the installation and source from https://github.com/msalvaris/gpu_monitor