We recently got access to a bare-metal Ampere server from Oracle Cloud. We are using it to do research on the performance of C/C++ compilers. We are comparing the results across AMD, Ampere, and Intel servers.
However, while the results on x86 machines are fairly stable, we are seeing quite wild results on Ampere machines. We see run time variations between runs of 25%, while on x86 machines it’s a low single digit %.
The Ampere machine has Ubuntu installed.
We’ve already tried the usual things, like setting core affinity, fiddling with CPU governor settings (which seem to be unavailable), etc.
Does anyone have any recommendation on how to make performance results more stable on Ampere platforms?
I’m curious what the test was? What kernel was used? All of my testing to date has been OL with the UEK kernel. Almost always better performance vs RHEL or SUSE.
Was it CPU time that was variable? Or was it storage bound? Happy to lend you all an hour to take a look.
One such example is the compress-zstd-1.6.0 benchmark from Phoronix.
For the first test in the benchmark, i.e. Compression Level: 3 - Compression Speed, on ARM we get a mean of 1689.6 MB/s with a relative standard deviation of 4.2%. On x86 we get a mean of 1301.1 MB/s with a relative standard deviation of 0.4%.
I attached the results files for both ARM [1] and x86 [2] so you can get an idea of the environments we use. To compile the benchmark you will need a stock clang 16, and use the following instructions:
You’re running in a dual socket scenario? I noticed your CPU for the Arm test was 160 cores - I assume that’s two Altras? Is there any cross-socket communication?
@lucianp, can you try a quick test with the 76 core A1 VM in OCI? That’s basically a virtualized subset of the machine you ran the test on. The results will likely be worse compared to all using 160, but it would be interesting to observe the variance.
@naren it’s not feasible for us to try a test with the 76 core A1 VM in OCI. What we can do is to use only a NUMA node from the current server if that’s the direction you’re pointing towards.
The thing is that even for the single threaded benchmarks we see a high variance in the results. I attached the results for ngspice-1.0.0 [1,2]. For the second test, on ARM, we have a relative standard deviation (RSD) of 7%, while on x86 for both tests the RSD is under 1%.