Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
STREAM measures sustainable memory bandwidth, which is critical for memory-bound workloads like computational fluid dynamics (CFD), finite element analysis, and data analytics. STREAM is a simple, synthetic benchmark that measures memory bandwidth for four vector operations:
| Operation | Description | Formula |
|---|---|---|
| Copy | Measures transfer rates | a(i) = b(i) |
| Scale | Adds simple arithmetic | a(i) = q × b(i) |
| Add | Multiple load/store operations | a(i) = b(i) + c(i) |
| Triad | Most representative | a(i) = b(i) + q × c(i) |
The Triad result is the standard metric for comparing memory bandwidth across systems.
Time to complete: 15-20 minutes
Prerequisites
- An Azure HPC VM (HBv3, HBv4, HBv5, or HX-series recommended)
- SSH access to the VM
- Root or sudo privileges
Tip
For best results, use Azure HPC marketplace images (AlmaLinux-HPC or Ubuntu-HPC) which include optimized compilers and libraries.
Expected results by VM family
Use these values to validate your results:
| VM Series | STREAM Triad (GB/s) | Notes |
|---|---|---|
| HBv5 (with HBM) | ~7,000 | Uses HBM memory |
| HBv4 | ~650-780 | DDR5 memory |
| HBv3 | ~330-350 | DDR4 memory |
| HBv2 | ~260 | DDR4 memory |
If your results are significantly lower (more than 10% below), check your configuration.
Step 1: Connect to your VM
Connect via SSH to your HPC VM:
ssh azureuser@<vm-public-ip>
Or connect through your Slurm login node if using a cluster.
Step 2: Install dependencies
Option A: Using Azure HPC images (recommended)
Azure HPC images include the necessary compilers. Verify GCC is available:
gcc --version
Option B: Manual installation
If using a standard image, install build tools:
# AlmaLinux/RHEL
sudo dnf groupinstall "Development Tools" -y
# Ubuntu
sudo apt update && sudo apt install build-essential -y
Step 3: Download and compile STREAM
Clone the Azure benchmarking repository which includes optimized STREAM configurations:
# Create working directory
mkdir -p ~/benchmarks && cd ~/benchmarks
# Clone Azure benchmarking repository
git clone https://github.com/Azure/woc-benchmarking.git
cd woc-benchmarking/apps/hpc/stream
Alternatively, download STREAM directly:
mkdir -p ~/benchmarks/stream && cd ~/benchmarks/stream
wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c
Compile with optimizations for AMD EPYC processors (used in HB-series):
gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=800000000 \
-DNTIMES=20 stream.c -o stream
Compiler flags explained:
| Flag | Purpose |
|---|---|
-O3 |
Maximum optimization level |
-march=znver3 |
Optimize for AMD Zen 3/4 architecture |
-fopenmp |
Enable OpenMP for multi-threading |
-DSTREAM_ARRAY_SIZE=800000000 |
Array size (~6 GB per array, 18 GB total) |
-DNTIMES=20 |
Number of iterations |
Important
The array size must be large enough that data doesn't fit in cache. For HBv4/HBv5 with 1.5 GB L3 cache, use at least 800M elements.
Step 4: Configure thread affinity
Proper thread pinning is critical for accurate results. Set OpenMP environment variables:
# Get number of physical cores
NCORES=$(lscpu | grep "^Core(s) per socket:" | awk '{print $4}')
NSOCKETS=$(lscpu | grep "^Socket(s):" | awk '{print $2}')
TOTAL_CORES=$((NCORES * NSOCKETS))
echo "Total physical cores: $TOTAL_CORES"
# Set OpenMP configuration
export OMP_NUM_THREADS=$TOTAL_CORES
export OMP_PROC_BIND=spread
export OMP_PLACES=cores
For HBv4 (176 cores):
export OMP_NUM_THREADS=176
export OMP_PROC_BIND=spread
export OMP_PLACES=cores
For HBv5 (standard configuration):
export OMP_NUM_THREADS=176
export OMP_PROC_BIND=spread
export OMP_PLACES=cores
Step 5: Run the benchmark
Execute STREAM:
./stream
Sample output (HBv4):
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 800000000 (elements), Offset = 0 (elements)
Memory per array = 6103.5 MiB (= 5.96 GiB).
Total memory required = 18310.5 MiB (= 17.88 GiB).
Each kernel will be executed 20 times.
-------------------------------------------------------------
Number of Threads requested = 176
Number of Threads counted = 176
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 753284.2 0.017157 0.016966 0.018884
Scale: 707935.3 0.018260 0.018045 0.019629
Add: 756972.9 0.025508 0.025318 0.027311
Triad: 757820.9 0.025464 0.025290 0.027212
-------------------------------------------------------------
The Triad Best Rate (757,820.9 MB/s = ~740 GB/s) is the key result.
Step 6: Validate results
Compare your Triad result against expected values:
# Quick validation script
TRIAD_RESULT=757820 # Replace with your result in MB/s
VM_TYPE="HBv4" # HBv2, HBv3, HBv4, or HBv5
case $VM_TYPE in
"HBv5") EXPECTED=7000000 ;;
"HBv4") EXPECTED=700000 ;;
"HBv3") EXPECTED=330000 ;;
"HBv2") EXPECTED=260000 ;;
esac
PERCENT=$(echo "scale=1; $TRIAD_RESULT * 100 / $EXPECTED" | bc)
echo "Achieved $PERCENT% of expected bandwidth"
Results interpretation:
| Achievement | Interpretation |
|---|---|
| 95-105% | Excellent - VM performing as expected |
| 85-95% | Good - Minor optimization possible |
| 70-85% | Investigate - Check thread affinity, NUMA |
| <70% | Problem - Check configuration |
Step 7: Run on multiple NUMA domains (advanced)
For detailed NUMA analysis, run STREAM per NUMA domain:
# Check NUMA topology
numactl --hardware
# Run on NUMA node 0 only
numactl --cpunodebind=0 --membind=0 \
OMP_NUM_THREADS=22 OMP_PROC_BIND=spread OMP_PLACES=cores ./stream
# Run on all NUMA domains (default full-node run)
numactl --interleave=all \
OMP_NUM_THREADS=176 OMP_PROC_BIND=spread OMP_PLACES=cores ./stream
Troubleshooting
Low bandwidth results
Symptom: Results significantly below expected values
Solutions:
Check thread count:
echo $OMP_NUM_THREADS # Should match physical core countVerify thread binding:
export OMP_DISPLAY_ENV=TRUE ./stream 2>&1 | head -20Check for CPU frequency scaling:
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # Should be "performance" for benchmarkingVerify NUMA memory policy:
numactl --show
Array size too small
Symptom: Results higher than expected (measuring cache, not memory)
Solution: Increase STREAM_ARRAY_SIZE at compile time. Total memory used should be at least 4× the L3 cache size.
# Recompile with larger array
gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=1000000000 \
-DNTIMES=20 stream.c -o stream
Inconsistent results
Symptom: Large variation between runs
Solutions:
Ensure no other processes are running:
top -b -n 1 | head -20Run more iterations:
# Recompile with more iterations gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=800000000 \ -DNTIMES=50 stream.c -o stream
Running STREAM in a Slurm job
If using a Slurm cluster, create a job script:
cat << 'EOF' > stream-job.sh
#!/bin/bash
#SBATCH --job-name=stream
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=176
#SBATCH --time=00:10:00
#SBATCH --partition=hpc
#SBATCH --exclusive
# Set thread configuration
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=cores
# Run STREAM
cd ~/benchmarks/woc-benchmarking/apps/hpc/stream
./stream
EOF
sbatch stream-job.sh
Automating with the Azure benchmarking scripts
The Azure woc-benchmarking repository includes automation scripts:
cd ~/benchmarks/woc-benchmarking/apps/hpc/stream
# View available scripts
ls -la
# Run automated benchmark (if available)
./run_stream.sh