Measuring memory bandwidth

Hey guys,

Do we have any tool like Intel PCM to measure memory bandwidth ? I’m digging into offloading large language models (LLMs) state to host memory.

Thank you.

There is opensource bandwidth tests, like STREAM, but it won’t show bandwidth in such detailed manner.

4 Likes

I use stream for aggregate bandwidth measurements…

wget -r --no-parent Index of /stream/FTP/Code

1 Like

I use the arm-cmn watchpoint PMU events to monitor the memory bandwidth of each of the memory channel. This needs the kernel has arm-cmn driver installed. And the command is quite different for different CPU.

1 Like

Can you share a little bit more like which command did you use ? I really appreciate it.

What’s your platform? Altra or AmepreOne?
This is the command I used for Ampere Altra:
perf stat -C 0 --event arm_cmn_0/watchpoint_up,wp_dev_sel=0,wp_chn_sel=3,wp_grp=0,wp_val=0xffffffffffffffff,wp_mask=0xffffffffffffffff,bynodeid=1,nodeid=16/ --event arm_cmn/watchpoint_down,wp_dev_sel=0,wp_chn_sel=3,wp_grp=0,wp_val=0xffffffffffffffff,wp_mask=0xffffffffffffffff,bynodeid=1,nodeid=16/ -I 1000

up counter is for memory read, and down counter is for memroy write. The fomula for calculting the memory bandwidth of total 8 channles:

read bandwidth(GB/s) = (watchpoint_up counter) * 32 * 8 /1000000000
write bandwidth(GB/s) = (watchpoint_down counter) * 32 * 8 /1000000000

2 Likes

There is a tool called mbw, which does exactly that, using 3 methods, memcpy, dumb and block copies. It’s in Debian under the same name, I guess it should be in the other distros as well.
For reference, this is what my M128-30 gives on trixie:

# mbw 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0       Method: MEMCPY  Elapsed: 0.10263        MiB: 1024.00000 Copy: 9977.201 MiB/s
1       Method: MEMCPY  Elapsed: 0.10275        MiB: 1024.00000 Copy: 9965.452 MiB/s
2       Method: MEMCPY  Elapsed: 0.10285        MiB: 1024.00000 Copy: 9956.247 MiB/s
3       Method: MEMCPY  Elapsed: 0.10263        MiB: 1024.00000 Copy: 9977.784 MiB/s
4       Method: MEMCPY  Elapsed: 0.10259        MiB: 1024.00000 Copy: 9981.188 MiB/s
5       Method: MEMCPY  Elapsed: 0.10261        MiB: 1024.00000 Copy: 9979.923 MiB/s
6       Method: MEMCPY  Elapsed: 0.10258        MiB: 1024.00000 Copy: 9982.550 MiB/s
7       Method: MEMCPY  Elapsed: 0.10259        MiB: 1024.00000 Copy: 9981.577 MiB/s
8       Method: MEMCPY  Elapsed: 0.10252        MiB: 1024.00000 Copy: 9988.295 MiB/s
9       Method: MEMCPY  Elapsed: 0.10286        MiB: 1024.00000 Copy: 9955.473 MiB/s
AVG     Method: MEMCPY  Elapsed: 0.10266        MiB: 1024.00000 Copy: 9974.557 MiB/s
0       Method: DUMB    Elapsed: 0.08828        MiB: 1024.00000 Copy: 11599.062 MiB/s
1       Method: DUMB    Elapsed: 0.08827        MiB: 1024.00000 Copy: 11600.245 MiB/s
2       Method: DUMB    Elapsed: 0.08821        MiB: 1024.00000 Copy: 11608.530 MiB/s
3       Method: DUMB    Elapsed: 0.08831        MiB: 1024.00000 Copy: 11595.516 MiB/s
4       Method: DUMB    Elapsed: 0.08831        MiB: 1024.00000 Copy: 11596.041 MiB/s
5       Method: DUMB    Elapsed: 0.08823        MiB: 1024.00000 Copy: 11605.898 MiB/s
6       Method: DUMB    Elapsed: 0.08821        MiB: 1024.00000 Copy: 11608.530 MiB/s
7       Method: DUMB    Elapsed: 0.08815        MiB: 1024.00000 Copy: 11617.090 MiB/s
8       Method: DUMB    Elapsed: 0.08826        MiB: 1024.00000 Copy: 11602.085 MiB/s
9       Method: DUMB    Elapsed: 0.08820        MiB: 1024.00000 Copy: 11609.714 MiB/s
AVG     Method: DUMB    Elapsed: 0.08824        MiB: 1024.00000 Copy: 11604.267 MiB/s
0       Method: MCBLOCK Elapsed: 0.02533        MiB: 1024.00000 Copy: 40418.394 MiB/s
1       Method: MCBLOCK Elapsed: 0.02530        MiB: 1024.00000 Copy: 40475.908 MiB/s
2       Method: MCBLOCK Elapsed: 0.02534        MiB: 1024.00000 Copy: 40404.040 MiB/s
3       Method: MCBLOCK Elapsed: 0.02530        MiB: 1024.00000 Copy: 40477.508 MiB/s
4       Method: MCBLOCK Elapsed: 0.02529        MiB: 1024.00000 Copy: 40487.111 MiB/s
5       Method: MCBLOCK Elapsed: 0.02529        MiB: 1024.00000 Copy: 40493.515 MiB/s
6       Method: MCBLOCK Elapsed: 0.02529        MiB: 1024.00000 Copy: 40495.116 MiB/s
7       Method: MCBLOCK Elapsed: 0.02532        MiB: 1024.00000 Copy: 40440.741 MiB/s
8       Method: MCBLOCK Elapsed: 0.02533        MiB: 1024.00000 Copy: 40431.160 MiB/s
9       Method: MCBLOCK Elapsed: 0.02529        MiB: 1024.00000 Copy: 40485.510 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.02531        MiB: 1024.00000 Copy: 40460.875 MiB/s

edit: now I see that you wanted detailed per channel memory bandwidth results, no, mbw does not seem to do that, sorry.

Came here to share Stream - but many folks have already chimed in :wink:

In case from UVA CS department: Index of /stream/FTP/Code

1 Like