Running your first benchmark using STREAM

STREAM measures sustainable memory bandwidth, which is critical for memory-bound workloads like computational fluid dynamics (CFD), finite element analysis, and data analytics. STREAM is a simple, synthetic benchmark that measures memory bandwidth for four vector operations:

Operation	Description	Formula
Copy	Measures transfer rates	a(i) = b(i)
Scale	Adds simple arithmetic	a(i) = q × b(i)
Add	Multiple load/store operations	a(i) = b(i) + c(i)
Triad	Most representative	a(i) = b(i) + q × c(i)

The Triad result is the standard metric for comparing memory bandwidth across systems.

Time to complete: 15-20 minutes

Prerequisites

An Azure HPC VM (HBv3, HBv4, HBv5, or HX-series recommended)
SSH access to the VM
Root or sudo privileges

Tip

For best results, use Azure HPC marketplace images (AlmaLinux-HPC or Ubuntu-HPC) which include optimized compilers and libraries.

Expected results by VM family

Use these values to validate your results:

VM Series	STREAM Triad (GB/s)	Notes
HBv5 (with HBM)	~7,000	Uses HBM memory
HBv4	~650-780	DDR5 memory
HBv3	~330-350	DDR4 memory
HBv2	~260	DDR4 memory

If your results are significantly lower (more than 10% below), check your configuration.

Step 1: Connect to your VM

Connect via SSH to your HPC VM:

ssh azureuser@<vm-public-ip>

Or connect through your Slurm login node if using a cluster.

Step 2: Install dependencies

Option A: Using Azure HPC images (recommended)

Azure HPC images include the necessary compilers. Verify GCC is available:

gcc --version

Option B: Manual installation

If using a standard image, install build tools:

# AlmaLinux/RHEL
sudo dnf groupinstall "Development Tools" -y

# Ubuntu
sudo apt update && sudo apt install build-essential -y

Step 3: Download and compile STREAM

Clone the Azure benchmarking repository which includes optimized STREAM configurations:

# Create working directory
mkdir -p ~/benchmarks && cd ~/benchmarks

# Clone Azure benchmarking repository
git clone https://github.com/Azure/woc-benchmarking.git
cd woc-benchmarking/apps/hpc/stream

Alternatively, download STREAM directly:

mkdir -p ~/benchmarks/stream && cd ~/benchmarks/stream
wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c

Compile with optimizations for AMD EPYC processors (used in HB-series):

gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=800000000 \
    -DNTIMES=20 stream.c -o stream

Compiler flags explained:

Flag	Purpose
`-O3`	Maximum optimization level
`-march=znver3`	Optimize for AMD Zen 3/4 architecture
`-fopenmp`	Enable OpenMP for multi-threading
`-DSTREAM_ARRAY_SIZE=800000000`	Array size (~6 GB per array, 18 GB total)
`-DNTIMES=20`	Number of iterations

Important

The array size must be large enough that data doesn't fit in cache. For HBv4/HBv5 with 1.5 GB L3 cache, use at least 800M elements.

Step 4: Configure thread affinity

Proper thread pinning is critical for accurate results. Set OpenMP environment variables:

# Get number of physical cores
NCORES=$(lscpu | grep "^Core(s) per socket:" | awk '{print $4}')
NSOCKETS=$(lscpu | grep "^Socket(s):" | awk '{print $2}')
TOTAL_CORES=$((NCORES * NSOCKETS))

echo "Total physical cores: $TOTAL_CORES"

# Set OpenMP configuration
export OMP_NUM_THREADS=$TOTAL_CORES
export OMP_PROC_BIND=spread
export OMP_PLACES=cores

For HBv4 (176 cores):

export OMP_NUM_THREADS=176
export OMP_PROC_BIND=spread
export OMP_PLACES=cores

For HBv5 (standard configuration):

export OMP_NUM_THREADS=176
export OMP_PROC_BIND=spread
export OMP_PLACES=cores

Step 5: Run the benchmark

Execute STREAM:

./stream

Sample output (HBv4):

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 800000000 (elements), Offset = 0 (elements)
Memory per array = 6103.5 MiB (= 5.96 GiB).
Total memory required = 18310.5 MiB (= 17.88 GiB).
Each kernel will be executed 20 times.
-------------------------------------------------------------
Number of Threads requested = 176
Number of Threads counted = 176
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          753284.2     0.017157     0.016966     0.018884
Scale:         707935.3     0.018260     0.018045     0.019629
Add:           756972.9     0.025508     0.025318     0.027311
Triad:         757820.9     0.025464     0.025290     0.027212
-------------------------------------------------------------

The Triad Best Rate (757,820.9 MB/s = ~740 GB/s) is the key result.

Step 6: Validate results

Compare your Triad result against expected values:

# Quick validation script
TRIAD_RESULT=757820  # Replace with your result in MB/s
VM_TYPE="HBv4"       # HBv2, HBv3, HBv4, or HBv5

case $VM_TYPE in
    "HBv5") EXPECTED=7000000 ;;
    "HBv4") EXPECTED=700000 ;;
    "HBv3") EXPECTED=330000 ;;
    "HBv2") EXPECTED=260000 ;;
esac

PERCENT=$(echo "scale=1; $TRIAD_RESULT * 100 / $EXPECTED" | bc)
echo "Achieved $PERCENT% of expected bandwidth"

Results interpretation:

Achievement	Interpretation
95-105%	Excellent - VM performing as expected
85-95%	Good - Minor optimization possible
70-85%	Investigate - Check thread affinity, NUMA
<70%	Problem - Check configuration

Step 7: Run on multiple NUMA domains (advanced)

For detailed NUMA analysis, run STREAM per NUMA domain:

# Check NUMA topology
numactl --hardware

# Run on NUMA node 0 only
numactl --cpunodebind=0 --membind=0 \
    OMP_NUM_THREADS=22 OMP_PROC_BIND=spread OMP_PLACES=cores ./stream

# Run on all NUMA domains (default full-node run)
numactl --interleave=all \
    OMP_NUM_THREADS=176 OMP_PROC_BIND=spread OMP_PLACES=cores ./stream

Troubleshooting

Low bandwidth results

Symptom: Results significantly below expected values

Solutions:

Check thread count:

echo $OMP_NUM_THREADS
# Should match physical core count

Verify thread binding:

export OMP_DISPLAY_ENV=TRUE
./stream 2>&1 | head -20

Check for CPU frequency scaling:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be "performance" for benchmarking

Verify NUMA memory policy:
```
numactl --show
```

Array size too small

Symptom: Results higher than expected (measuring cache, not memory)

Solution: Increase STREAM_ARRAY_SIZE at compile time. Total memory used should be at least 4× the L3 cache size.

# Recompile with larger array
gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=1000000000 \
    -DNTIMES=20 stream.c -o stream

Inconsistent results

Symptom: Large variation between runs