Hey guys,
Do we have any tool like Intel PCM to measure memory bandwidth ? I’m digging into offloading large language models (LLMs) state to host memory.
Thank you.
Hey guys,
Do we have any tool like Intel PCM to measure memory bandwidth ? I’m digging into offloading large language models (LLMs) state to host memory.
Thank you.
There is opensource bandwidth tests, like STREAM, but it won’t show bandwidth in such detailed manner.
I use stream for aggregate bandwidth measurements…
wget -r --no-parent Index of /stream/FTP/Code
I use the arm-cmn watchpoint PMU events to monitor the memory bandwidth of each of the memory channel. This needs the kernel has arm-cmn driver installed. And the command is quite different for different CPU.
Can you share a little bit more like which command did you use ? I really appreciate it.
What’s your platform? Altra or AmepreOne?
This is the command I used for Ampere Altra:
perf stat -C 0 --event arm_cmn_0/watchpoint_up,wp_dev_sel=0,wp_chn_sel=3,wp_grp=0,wp_val=0xffffffffffffffff,wp_mask=0xffffffffffffffff,bynodeid=1,nodeid=16/ --event arm_cmn/watchpoint_down,wp_dev_sel=0,wp_chn_sel=3,wp_grp=0,wp_val=0xffffffffffffffff,wp_mask=0xffffffffffffffff,bynodeid=1,nodeid=16/ -I 1000
up counter is for memory read, and down counter is for memroy write. The fomula for calculting the memory bandwidth of total 8 channles:
read bandwidth(GB/s) = (watchpoint_up counter) * 32 * 8 /1000000000
write bandwidth(GB/s) = (watchpoint_down counter) * 32 * 8 /1000000000
There is a tool called mbw, which does exactly that, using 3 methods, memcpy, dumb and block copies. It’s in Debian under the same name, I guess it should be in the other distros as well.
For reference, this is what my M128-30 gives on trixie:
# mbw 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0 Method: MEMCPY Elapsed: 0.10263 MiB: 1024.00000 Copy: 9977.201 MiB/s
1 Method: MEMCPY Elapsed: 0.10275 MiB: 1024.00000 Copy: 9965.452 MiB/s
2 Method: MEMCPY Elapsed: 0.10285 MiB: 1024.00000 Copy: 9956.247 MiB/s
3 Method: MEMCPY Elapsed: 0.10263 MiB: 1024.00000 Copy: 9977.784 MiB/s
4 Method: MEMCPY Elapsed: 0.10259 MiB: 1024.00000 Copy: 9981.188 MiB/s
5 Method: MEMCPY Elapsed: 0.10261 MiB: 1024.00000 Copy: 9979.923 MiB/s
6 Method: MEMCPY Elapsed: 0.10258 MiB: 1024.00000 Copy: 9982.550 MiB/s
7 Method: MEMCPY Elapsed: 0.10259 MiB: 1024.00000 Copy: 9981.577 MiB/s
8 Method: MEMCPY Elapsed: 0.10252 MiB: 1024.00000 Copy: 9988.295 MiB/s
9 Method: MEMCPY Elapsed: 0.10286 MiB: 1024.00000 Copy: 9955.473 MiB/s
AVG Method: MEMCPY Elapsed: 0.10266 MiB: 1024.00000 Copy: 9974.557 MiB/s
0 Method: DUMB Elapsed: 0.08828 MiB: 1024.00000 Copy: 11599.062 MiB/s
1 Method: DUMB Elapsed: 0.08827 MiB: 1024.00000 Copy: 11600.245 MiB/s
2 Method: DUMB Elapsed: 0.08821 MiB: 1024.00000 Copy: 11608.530 MiB/s
3 Method: DUMB Elapsed: 0.08831 MiB: 1024.00000 Copy: 11595.516 MiB/s
4 Method: DUMB Elapsed: 0.08831 MiB: 1024.00000 Copy: 11596.041 MiB/s
5 Method: DUMB Elapsed: 0.08823 MiB: 1024.00000 Copy: 11605.898 MiB/s
6 Method: DUMB Elapsed: 0.08821 MiB: 1024.00000 Copy: 11608.530 MiB/s
7 Method: DUMB Elapsed: 0.08815 MiB: 1024.00000 Copy: 11617.090 MiB/s
8 Method: DUMB Elapsed: 0.08826 MiB: 1024.00000 Copy: 11602.085 MiB/s
9 Method: DUMB Elapsed: 0.08820 MiB: 1024.00000 Copy: 11609.714 MiB/s
AVG Method: DUMB Elapsed: 0.08824 MiB: 1024.00000 Copy: 11604.267 MiB/s
0 Method: MCBLOCK Elapsed: 0.02533 MiB: 1024.00000 Copy: 40418.394 MiB/s
1 Method: MCBLOCK Elapsed: 0.02530 MiB: 1024.00000 Copy: 40475.908 MiB/s
2 Method: MCBLOCK Elapsed: 0.02534 MiB: 1024.00000 Copy: 40404.040 MiB/s
3 Method: MCBLOCK Elapsed: 0.02530 MiB: 1024.00000 Copy: 40477.508 MiB/s
4 Method: MCBLOCK Elapsed: 0.02529 MiB: 1024.00000 Copy: 40487.111 MiB/s
5 Method: MCBLOCK Elapsed: 0.02529 MiB: 1024.00000 Copy: 40493.515 MiB/s
6 Method: MCBLOCK Elapsed: 0.02529 MiB: 1024.00000 Copy: 40495.116 MiB/s
7 Method: MCBLOCK Elapsed: 0.02532 MiB: 1024.00000 Copy: 40440.741 MiB/s
8 Method: MCBLOCK Elapsed: 0.02533 MiB: 1024.00000 Copy: 40431.160 MiB/s
9 Method: MCBLOCK Elapsed: 0.02529 MiB: 1024.00000 Copy: 40485.510 MiB/s
AVG Method: MCBLOCK Elapsed: 0.02531 MiB: 1024.00000 Copy: 40460.875 MiB/s
edit: now I see that you wanted detailed per channel memory bandwidth results, no, mbw does not seem to do that, sorry.
Came here to share Stream - but many folks have already chimed in ![]()
In case from UVA CS department: Index of /stream/FTP/Code