你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

使用 STREAM 运行第一个基准测试

STREAM 测量可持续内存带宽，这对于内存绑定工作负载至关重要，例如计算流体动力学（判发器）、有限元素分析和数据分析。 STREAM 是一个简单的综合基准，用于测量四个向量操作的内存带宽。

操作	Description	Formula
复制	度量传输速率	a（i） = b（i）
Scale	添加简单的算术	a（i） = q × b（i）
添加	多个加载/存储操作	a（i） = b（i） + c（i）
三元组	最具代表性	a（i） = b（i） + q × c（i）

Triad 结果是用于比较系统之间的内存带宽的标准指标。

完成时间：15-20 分钟

先决条件

Azure HPC VM（建议使用 HBv3、HBv4、HBv5 或 HX 系列）
对 VM 的 SSH 访问
根权限或 sudo 权限

小窍门

为了获得最佳结果，请使用 Azure HPC 市场映像（AlmaLinux-HPC 或 Ubuntu-HPC），其中包括优化的编译器和库。

VM 系列的预期结果

使用这些值验证结果：

VM 系列	STREAM Triad（GB/s）	注释
HBv5 （带 HBM）	~7,000	使用 HBM 内存
HBv4	~650-780	DDR5 内存
HBv3	~330-350	DDR4 内存
HBv2	~260	DDR4 内存

如果结果明显较低（低于下面的 10 个以上%），请检查配置。

步骤 1：连接到 VM

通过 SSH 连接到 HPC VM：

ssh azureuser@<vm-public-ip>

或者，如果使用群集，请通过 Slurm 登录节点进行连接。

步骤 2：安装依赖项

选项 A：使用 Azure HPC 映像（建议）

Azure HPC 映像包括必要的编译器。验证 GCC 是否可用：

gcc --version

选项 B：手动安装

如果使用标准映像，请安装生成工具：

# AlmaLinux/RHEL
sudo dnf groupinstall "Development Tools" -y

# Ubuntu
sudo apt update && sudo apt install build-essential -y

步骤 3：下载和编译 STREAM

克隆 Azure 基准存储库，其中包含优化的 STREAM 配置：

# Create working directory
mkdir -p ~/benchmarks && cd ~/benchmarks

# Clone Azure benchmarking repository
git clone https://github.com/Azure/woc-benchmarking.git
cd woc-benchmarking/apps/hpc/stream

或者，直接下载 STREAM：

mkdir -p ~/benchmarks/stream && cd ~/benchmarks/stream
wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c

使用 AMD EPYC 处理器的优化进行编译（在 HB 系列中使用）：

gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=800000000 \
    -DNTIMES=20 stream.c -o stream

编译器标志说明：

Flag	目的
`-O3`	最大优化级别
`-march=znver3`	针对 AMD Zen 3/4 体系结构进行优化
`-fopenmp`	为多线程启用 OpenMP
`-DSTREAM_ARRAY_SIZE=800000000`	数组大小（每个数组约 6 GB，总计 18 GB）
`-DNTIMES=20`	迭代数

重要

数组大小必须足够大，以至于数据不能全部放入缓存。对于具有 1.5 GB L3 缓存的 HBv4/HBv5，请使用至少 800M 元素。

步骤 4：配置线程相关性

正确的线程固定对于准确结果至关重要。设置 OpenMP 环境变量：

# Get number of physical cores
NCORES=$(lscpu | grep "^Core(s) per socket:" | awk '{print $4}')
NSOCKETS=$(lscpu | grep "^Socket(s):" | awk '{print $2}')
TOTAL_CORES=$((NCORES * NSOCKETS))

echo "Total physical cores: $TOTAL_CORES"

# Set OpenMP configuration
export OMP_NUM_THREADS=$TOTAL_CORES
export OMP_PROC_BIND=spread
export OMP_PLACES=cores

对于 HBv4（176 个核心）：

export OMP_NUM_THREADS=176
export OMP_PROC_BIND=spread
export OMP_PLACES=cores

对于 HBv5（标准配置）：

export OMP_NUM_THREADS=176
export OMP_PROC_BIND=spread
export OMP_PLACES=cores

步骤 5：运行基准

执行 STREAM：

./stream

示例输出 （HBv4）：

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 800000000 (elements), Offset = 0 (elements)
Memory per array = 6103.5 MiB (= 5.96 GiB).
Total memory required = 18310.5 MiB (= 17.88 GiB).
Each kernel will be executed 20 times.
-------------------------------------------------------------
Number of Threads requested = 176
Number of Threads counted = 176
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          753284.2     0.017157     0.016966     0.018884
Scale:         707935.3     0.018260     0.018045     0.019629
Add:           756972.9     0.025508     0.025318     0.027311
Triad:         757820.9     0.025464     0.025290     0.027212
-------------------------------------------------------------

关键结果是Triad最佳速率（757,820.9 MB/s = 约 740 GB/秒）。

步骤 6：验证结果

将 Triad 结果与预期值进行比较：

# Quick validation script
TRIAD_RESULT=757820  # Replace with your result in MB/s
VM_TYPE="HBv4"       # HBv2, HBv3, HBv4, or HBv5

case $VM_TYPE in
    "HBv5") EXPECTED=7000000 ;;
    "HBv4") EXPECTED=700000 ;;
    "HBv3") EXPECTED=330000 ;;
    "HBv2") EXPECTED=260000 ;;
esac

PERCENT=$(echo "scale=1; $TRIAD_RESULT * 100 / $EXPECTED" | bc)
echo "Achieved $PERCENT% of expected bandwidth"

结果解释：

成就	解释
95-105%	出色 - VM 按预期运行
85-95%	良好 - 可能进行轻微优化
70-85%	调查 - 检查线程相关性、NUMA
<70%	问题 - 检查配置

步骤 7：在多个 NUMA 域上运行（高级）

要进行详细的 NUMA 分析，请在每个 NUMA 域上运行 STREAM：

# Check NUMA topology
numactl --hardware

# Run on NUMA node 0 only
numactl --cpunodebind=0 --membind=0 \
    OMP_NUM_THREADS=22 OMP_PROC_BIND=spread OMP_PLACES=cores ./stream

# Run on all NUMA domains (default full-node run)
numactl --interleave=all \
    OMP_NUM_THREADS=176 OMP_PROC_BIND=spread OMP_PLACES=cores ./stream

故障排除

低带宽结果

症状：结果明显低于预期值

解决方法：

检查线程计数：

echo $OMP_NUM_THREADS
# Should match physical core count

验证线程绑定：

export OMP_DISPLAY_ENV=TRUE
./stream 2>&1 | head -20

检查 CPU 频率缩放：

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be "performance" for benchmarking

验证 NUMA 内存策略：
```
numactl --show
```

数组大小太小

症状：结果高于预期（测量缓存，而不是内存）

解决方案：在编译时增加 STREAM_ARRAY_SIZE 。使用的总内存应至少为 4× L3 缓存大小。

# Recompile with larger array
gcc -O3 -march=znver3 -fopenmp -DSTREAM_ARRAY_SIZE=1000000000 \
    -DNTIMES=20 stream.c -o stream

结果不一致

症状：运行之间的差异很大