Could Someone Give me Advice for Optimizing Performance on Ampere Altra Systems?

Hello there,

I am new to working with Ampere Altra systems and have been exploring their potential for high performance computing workloads. The architecture is impressive; and I am excited to fully utilize its capabilities.

My primary focus is on optimizing performance for a variety of applications; including data analytics; machine learning; and cloud native services. I have read through some of the official documentation and have experimented with a few configurations; but I am interested in hearing from those who have more hands-on experience with these systems.

Are there any particular compiler flags or settings that you have found to significantly boost performance on Altra systems? :thinking: I have been using GCC primarily; but I am open to other compilers if they offer better results.

Given the high core count of Ampere Altra processors; what are the best practices for managing memory and minimizing latency? Any tips on NUMA configurations or other memory related optimizations would be appreciated.

How do these systems perform with virtualized environments or containerized workloads? Are there specific settings or tools that can help optimize performance in these contexts? :thinking:

If you are using Altra systems for specialized workloads; such as AI/ML; HPC; or databases; what specific tuning or configurations have you found to be effective? :thinking:

Also, I have gone through this post; https://community.amperecomputing.com/t/help-needed-ampere-128-core-vs-ampere-96-core-jeff-geerling-performance-testing-golang/ which definitely helped me out a lot.

I am also interested in any benchmarks or case studies that might provide insights into maximizing performance.

Thank you in advance for your help and assistance. :innocent:

2 Likes

You can take a look at this case study

Another case study from Netflix with the same problem (cache thrashing) but on Intel CPUs.

Hi @Sennorita ,

I don’t know your workloads, but our company specializes in software optimization in general and in particular in SIMD optimizations, in C, C++ and Rust. In terms of optimization on Arm in general I have written for Arm a few Learning Path materials which you could find gathered here, along with other useful info:

Also, I’ve just completed an Arm SIMD on Rust guide which is currently under review but should be live shortly, I’ll post a link when it’s done.

1 Like

@konstantinos it is live, I was going to post something about it.

@Aaron I don’t see it anywhere yet :slight_smile:

@Aaron here it is: Learn how to write SIMD code on Arm using Rust | Arm Learning Paths I think it’s my best LP yet. :slight_smile:

2 Likes

Hi Sennorita!

In general, performance tuning is kind of black magic… we do have a detailed guide! Performance Analysis Methodology for Optimizing Ampere Processors

There are a few “table stakes” things:

  1. For C: compiler flags specifying the architecture: GCC Guide for Ampere Processors - at a high level, on Ampere Altra, use -mcpu=neoverse-n1
  2. For Java: Use the right JVM ergonomics (best default is probably ParallelGC with max and min heap size about 75% of available memory for your VM in cloud workloads) and a more recent JVM (Java 21 or later is about 30% faster with no code changes than Java 8). Some information here may help: Unlocking Java performance on Ampere Altra
  3. For NUMA related topics: it’s really best to run applications on a single NUMA node only, and to ensure that you put memory, network devices, and disk on the same NUMA node as your applications. The Scylla bug blog post that another poster pointed you to is an excellent resource in performance tooling to diagnose issues related to NUMA boundary issues. This performance brief walks you through all the performance tooling you will need, including how to diagnose NUMA issues: The First 10 Questions to Answer While Running on Ampere Altra-Based Instances
  4. For application specific tuning: it really is application dependent. In general, if your application is memory-bound for performance, you will want to have more memory available, and try to keep as much of your data in memory as possible - that means warming cache at start time, using HugePages where possible - you can see performance improve by using larger kernel memory pages too (Ampere supports the typical 4kB pages, but also 16kB and 64kB pages - bigger memory pages means less frequent page faults and higher performance for certain workloads).

There’s way to much to go into in one post, but I have been gathering some information about different architecture related performance differences between x86 and Ampere and if you search for my threads here you will find some more examples. I also have a backlog that I have been meaning to write up.

In general: the best place for this kind of thing is our workload briefs - they go into settings used in our benchmarking - and tuning guides - that describe tooling and methodology. Also, our developer stories in the case studies section of the website often goes into low level detail on performance tuning. For example, the NETINT case study describes accelerating video decoding using NEON intrinsics and accelerating performance by using DMA instead of iommu, the Momento case study describes how thread context switches were impacting performance, so the team pinned cores for network I/O (to prevent kernel IRQs from interrupting the main threads) and for the main threads returning cached values, tripling performance in terms of requests per second at SLO).

Hope this helps!
Dave.