I’m eventually testing the tenstorrent blackhole on an AmpereOne and one annoying problem is the following, memcpy is utterly slow:
This snippet on the ampereone produces:
stnp (no prefetch) 9.78 GB/s
stnp + pf@512B 10.33 GB/s
stnp + pf@2KB (current) 10.84 GB/s
libc memcpy 9.62 GB/s
on x86 easily 3x
Can you spot any glaring mistake?
1 Like
Could be that the difference is in the x86 stream 256 use versus the builtin_memcpy on Aarch64.
Arm’s memcpy approach here may be a help:
Also, can you make use of the Arm NEON instructions that are built into Ampere for your use case? They would provide some AVX2 like behaviour and use more execution bandwidth - there are two NEON SIMD units per core - so potentially 256 per socket on Altra Max M128 SoCs.
I’ll try to directly use neon and see how far it goes compared.
1 Like
so far it isn’t better than what the compiler comes up with.
The benchmark and the hardware are actually both performing as designed. The AmpereOne processor sacrifices single-thread memory performance in order to have such small and low-power cores, The picture changes when you scale up in thread count and also memory footprint. Also, note that this test is entirely L1 cache resident and does the exact same copy over many times in order to generate a performance number. In real applications it’s quite rare to have data that you want to memcpy already sitting right there in the L1. Generally when a memcpy operates on L1 resident data it will only do it once.
The performance gap vs. x86 will narrow (and eventually disappear) as you increase the number of parallel threads doing the memcpy, and also increase the data size to something greater than the 64 KB L1 cache.
Memory-intensive LLM/AI workloads actually do perform quite well on AmpereOne, but it’s necessary to scale out to more parallel threads/CPUs to get there. Using more of these low-cost CPUs will increase the aggregate data bandwidth available to your VM.
1 Like
I’m fixing the tenstorrent stack so I can use my blackhole with my AmpereOne, so far in certain benchmarks it is as good as the x86 numbers, but on others not as much and I’m trying little by little to see how to cover the gaps, one of the low hanging fruits were the optimized memcpys.
Thank you for confirming I’m not doing anything too stupid on that impl 
@lu_zero why do you pair the Blackhole with AmpereOne instead of a x86_64 CPU?
1 Like
Because I have both pieces of hardware and potentially it could be a good match once some issues get ironed out.
2 Likes