Platform-related JVM settings?

Hi all,

Calling all Java developers! @Erik I’m looking at you :wink: I’ve been wondering when the underlying platform behaviour can have an effect on application performance - anecdotally, I have heard that there can be performance differences between Ampere and x86 instances (also in virtualized vs bare metal environments), and that many of those performance differences can be mitigated or overcome by setting different defaults for the host and the JVM (things like core pinning, changing GC settings, heap size, I/O block size, …).

I would like to put together a guide to JVM tuning for given workloads, with recommendations of when to use what garbage collector, figuring out heap sizes, and which system settings can affect performance of the JVM - with a focus on mitigating architectural differences. Anyone interested in collaborating on this?

Dave.

2 Likes

I would love to see that; thanks for this initiative, @dneary. :+1: Within the Jenkins project, we’re increasingly utilizing aarch64 machines in Azure Cloud, but haven’t yet felt the need to fine-tune the JVMs for this platform.
We were already seeing a 5% decrease in the bill.
If tuning the virtual machines could potentially improve performance, count me in!

3 Likes

Following Bruno Borges, I already know a few bits and pieces:

  • 1 CPU instances, Parallel GC is fine, aim for 75% heap size
  • 2 CPU or more: explicitly set Parallel GC or, in the case of large memory requirements, G1, CMS, or Z (depending on size of heap)
  • When using Kubernetes with replicas, fewer, bigger instances will perform better than more, smaller instances
  • If you cap CPU usage to 1000 millicores, you are likely to throttle your CPU. Use 2000 millicores minimum, or actual cores available for larger instances, to reduce CPU waits and increase throughput and decrease tail latencies.

There are many more tuning options, these are generic “getting started” settings, well worth doing - but not platform dependent. I would love to have a better understanding of where platform design differences can impact performance with default settings.

Also, there is already an “Unlocking Java performance” tuning guide: Unlocking Java Performance Tuning Guide - but, at least to me, it doesn’t seem to be very workload-focused - I imaging that different tuning options might be possible if you are, say, using Kubernetes with 16 core VMs in your node pool.

OK - after a lot of learning, here’s a summary of what I have learned so far.

  • While there are occasionally differences at the platform level, most optimization will start at the application level, and will be common across x86 and Arm64, and only rarely will you find major platform-specific issues
  • There are a small number of exceptions - specifically, Arm’s performance hit communicating across NUMA boundaries can be high. If in doubt, retry your performance testing in a 1P configuration to see if it can be reproduced
  • At the highest level, you start by seeing where your application spends time. Is it bottlenecked on network or disk I/O, on memory accesses, or in CPU? If you are not maxing out your CPU, your problem is unlikely to be related to the architecture
  • Garbage collector and heap size decision:
    • Avoid JVM default ergonomics for standalone cloud workloads - they were created for shared server workloads and workstations
    • If you see a lot of CPU wait time waiting for garbage collection runs to end, start with your garbage collector - are you using the optimial GC and heap size for your application? SerialGC is the default on single-core containers with small heap sizes - ParallelGC might be better with small heap sizes, even on single-core applications, because you can still have thread pools, even on single-core workloads
    • For large heaps and more than one core, use G1, Z, or Shenandoah, depending on the workload
    • Set max heap size with -XX:MaxRAMPercentage to 75% of your container’s intended memory allocation for cloud container workloads
  • For container workloads, use Kubernetes resource limits for CPU and memory allocations for containers - fewer, larger instances will work better than many small replicas - you can still tell the JVM to use more processors with -XX:ActiveProcessorCount=2 even when a container is limited to 1000m for CPU limit to ensure you don’t run out of time allotment before the end of a cycle (Java applications can use more than 1000m of time slices running on a single CPU because of threading)
  • Outside of general JVM settings: profiling tools! Start with Java FlightRecorder, gprof-ng, and perf with flamegraphs

What happens next will depend a lot on what you see here! If you are maxing out your CPU, the next place to look is where you are losing time. There are lots of places that you can lose time, and here’s where you get into the Dark Magic of system performance tuning (PDF link to “Performance Analysis Methodology for Optimizing Altra Family CPUs”). Brendan Gregg again has great advice on this.

As just two examples, here are some examples of optimizations:

  1. One of the places that performance hits can happen is in context switches. Momento tripled throughput at SLO by pinning critical threads to specific cores to minimize context switching for long-running threads, and pinning network I/O to different threads. This works because network TX and RX interrupts preempt userspace and trigger a context switch - pinning these avoids context switching overhead for critical application threads, and optimizes network I/O.
  2. NUMA issues can be sneaky buggers - ScyllaDB developer Michał Chojnowski went deep to find that writing to the reference “nil_root” (which does nothing - it is a special global tree node used by C++ in tree algorithms to prevent illegal memory writes) happened to be on a different NUMA node to the rest of the program sometimes - and when it was, it caused a cache miss, which added considerable overhead. Oracle has a great guide on optimizing workloads on Ampere instances, by the way.

In both these cases, the issue was not inherently related to the architecture, although it represented a slow-down on Ampere from expected performance. The lesson is, you first need to identify what your bottleneck, see where you are spending time if the bottleneck is CPU related, and then dig in to identify the remedy. It may end up being JVM tuning or garbage collection parameters, but that might not suffice.

2 Likes

Having said that, there are some JVM defaults that can offer improves performance than defaults for workloads like SpecJBB. Specifically, Arm have improved throughput by up to 80% by using large pages, increasing initial and maximum heap size, using ParallelGC, and configuring it to use more GC threads, and tweaking various memory configuration options in the kernel. But, as I said above, results like these are workload dependent, and probably not good defaults for all Java applications. It turns out that Java performance tuning is the definition of YMMV (Your Mileage May Vary)!

2 Likes

Alafia Ai got a tremendous performance improvement moving to 64K page. Ampere engineers helped.

2 Likes

Large memory pages can be a huge performance improvement for a lot of use-cases! It can also reveal bugs in code where 4k kernel memory pages are assumed.

Which is exactly what we discovered in route to those huge performance improvements🤣

1 Like