Unstable benchmark results

Hi,

We recently got access to a bare-metal Ampere server from Oracle Cloud. We are using it to do research on the performance of C/C++ compilers. We are comparing the results across AMD, Ampere, and Intel servers.

However, while the results on x86 machines are fairly stable, we are seeing quite wild results on Ampere machines. We see run time variations between runs of 25%, while on x86 machines it’s a low single digit %.

The Ampere machine has Ubuntu installed.

We’ve already tried the usual things, like setting core affinity, fiddling with CPU governor settings (which seem to be unavailable), etc.

Does anyone have any recommendation on how to make performance results more stable on Ampere platforms?

Thank you!

@Erik Have you seen such a test variation in some of the tests you have done?

@nlopes I know it isn’t the same but it will take you to Erik’s blogs about OCI while you are waiting for a response.

New BLOG Arm vs X64 Oracle Database performance!

And welcome to the community! And thanks for the question.

1 Like

I’m curious what the test was? What kernel was used? All of my testing to date has been OL with the UEK kernel. Almost always better performance vs RHEL or SUSE.

Was it CPU time that was variable? Or was it storage bound? Happy to lend you all an hour to take a look.

1 Like

What benchmarks are you running? I would love to dig in! I have heard of some software bugs that caused performance variance.

One such example is the compress-zstd-1.6.0 benchmark from Phoronix.

For the first test in the benchmark, i.e. Compression Level: 3 - Compression Speed, on ARM we get a mean of 1689.6 MB/s with a relative standard deviation of 4.2%. On x86 we get a mean of 1301.1 MB/s with a relative standard deviation of 0.4%.

I attached the results files for both ARM [1] and x86 [2] so you can get an idea of the environments we use. To compile the benchmark you will need a stock clang 16, and use the following instructions:

> CC=/path/to/clang-16 CXX=/path/to/clang++-16 phoronix-test-suite debug-install compress-zstd-1.6.0
> phoronix-test-suite batch-run compress-zstd-1.6.0

Thanks,
Lucian

[1] compress-zstd-1.6.0-baseline-arm.pdf - Google Drive
[2] compress-zstd-1.6.0-baseline-x86.pdf - Google Drive

1 Like

You’re running in a dual socket scenario? I noticed your CPU for the Arm test was 160 cores - I assume that’s two Altras? Is there any cross-socket communication?

Dave.

Yes, we use all the cores for our benchmarks. We have two NUMA nodes with 80 cores each. I’m attaching the output of lscpu:

lucian@arm:~$ lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 160
  On-line CPU(s) list:  0-159
Vendor ID:              ARM
  Model name:           Neoverse-N1
    Model:              1
    Thread(s) per core: 1
    Core(s) per socket: 80
    Socket(s):          2
    Stepping:           r3p1
    Frequency boost:    disabled
    CPU max MHz:        3000.0000
    CPU min MHz:        1000.0000
    BogoMIPS:           50.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):    
  L1d:                  10 MiB (160 instances)
  L1i:                  10 MiB (160 instances)
  L2:                   160 MiB (160 instances)
NUMA:                   
  NUMA node(s):         2
  NUMA node0 CPU(s):    0-79
  NUMA node1 CPU(s):    80-159
Vulnerabilities:        
  Gather data sampling: Not affected
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Mitigation; CSV2, BHB
  Srbds:                Not affected
  Tsx async abort:      Not affected

@lucianp, can you try a quick test with the 76 core A1 VM in OCI? That’s basically a virtualized subset of the machine you ran the test on. The results will likely be worse compared to all using 160, but it would be interesting to observe the variance.

@naren it’s not feasible for us to try a test with the 76 core A1 VM in OCI. What we can do is to use only a NUMA node from the current server if that’s the direction you’re pointing towards.

The thing is that even for the single threaded benchmarks we see a high variance in the results. I attached the results for ngspice-1.0.0 [1,2]. For the second test, on ARM, we have a relative standard deviation (RSD) of 7%, while on x86 for both tests the RSD is under 1%.

[1] ngspice-1.0.0-baseline-arm.pdf - Google Drive
[2] ngspice-1.0.0-baseline-x86.pdf - Google Drive

Please note that if you wish to reproduce the results for ngspice you need to delete the --enable-openmp option from the install.sh script here [1].

[1] Ngspice v1.0.0 Test [ngspice] - OpenBenchmarking.org

Can I please see the steps you used to download and run the test?

Erik

> CC=/path/to/clang-16 CXX=/path/to/clang++-16 phoronix-test-suite debug-install compress-zstd-1.6.0
> phoronix-test-suite batch-run compress-zstd-1.6.0

These are the steps. I did my experiments with Phoronix v10.8.4.

1 Like

I did some testing, and the results look to be more stable after they have run for a little bit. I wonder if there is a thermal issue? Ideas?

Maybe the unstable benchmark result could be related to this issue in Scylladb