Looking for additional ideas: Optimizing Java applications for the cloud

Hi all,

I have another in-draft technical article I would love to get community input on what should be in (and potential not in) scope. I would like to document all of the potential ways in which you can impact the performance of Java applications running in the cloud (either in VMs or managed by Kubernetes). I have a few starting points - many of them, as you would expect, are not Arm64 specific - but of course I would love to highlight issues that will (positively) impact performance on Ampere especially.

So far, the scope of the article is:

  1. JVM configuration:
    • Use a newer Java - Java 21 is over 30% faster out of the box than Java 8 on Arm64, just because of the optimization work (intrinsics, NEON implementations of core routines, general optimization) that has been done in the meantime
    • Set heap size to 75-85% of available memory - cloud applications are not running with other applications, and if you set max heap to less, you are wasting money
    • Choose the right GC - Java ergonomics may well pick SerialGC (the absolute worst choice in environments where you have multiple cores) because of either Kubernetes config or because of limited awareness of available memory - if in doubt, G1GC is a great default choice. For latency sensitive and memory intensive applications, ZGC or Shenandoah might be better choices
    • For memory-intensive workloads, using Transparent Huge Pages and a larger kernel page size can give you larger chunks of contiguous memory, and using -XX:+AlwaysPreTouch -XX:+UseTransparentHugePages will ensure that the memory is pre-warmed for use (costs start-up time in exchange for reduced memory latency in use)
    • Try trading off start time for faster runtime by pre-compiling code using -XX:-TieredCompilation -XX:ReservedCodeCacheSize=64M -XX:InitialCodeCacheSize=64M
  2. Manage resources in Kubernetes
    • In the JVM configuration, regardless of how many millicores you are assigning to the workload, use -XX: ActiveProcessorCount greater than 2 - Kubernetes resource allocations are quotas for time across all available cores, not a restriction to run on a single core
    • Explicitly set memory and CPU allowances in the Kubernetes pod configuration - otherwise the JVM ergonomics may detect that they have more or less resources available than they have - before Java 17, a container detected all of the host memory as being available to it, even if many containers were running!
    • Use Arm64 instances to reduce cost if you have the option! In Kubernetes, this is as simple as ensuring that your Java application is available in an Arm64 container image, then setting "nodeAffinity"to prefer running on Arm64 instances if they are available. If you are running on a managed Kubernetes like OKE, the A1 instance type, powered by Ampere, offers considerable price and performance benefits over x86 equivalents.
    • Basically echo the advice in ā€œWhat’s New in Java 17 - container awarenessā€
  3. Configure your OS to use what you’re paying for
    • Be aware that tail latencies and throughput are often in competition - configuring a system for predictable response time will impact median response time, maximizing throughput may result in higher tail latencies and longer queue times
    • Use tuned to set the CPU performance governor appropriately - the OS is tuned for ā€œbalancedā€ power to performance by default. If you care about performance, optimize your system for it
    • network-throughput tunes buffer sizes for both the OS and network to maximize throughput (increased buffer sizes) and turns off power management features that turn off devices when idle
    • network-latency reduces buffer sizes to ensure that traffic is treated as quickly as possible - but will impact throughput
    • tuned-adm supports dozens of profiles out of the box, and supports creating new ones that encapsulate kernel and hardware configuration to optimize for specific use-cases - see docs for more
    • Try a 64K kernel page size kernel if your application is memory-intensive
    • Setting a longer jiffy size by setting kernel tick frequency lower reduces context switches and interrupts. Try compiling your kernel with CONFIG_HZ_100 for build servers

I was going to include something about optimizing memory access modes using VarHandle to avoid unnecessary barrier instructions in multithreaded environments, but frankly I don’t understand it well enough to explain it!

Beyond this set of advice, are there any big things I’m missing? I feel like Java code out of the box is pretty close to parity in performance with x86 at this point.

Is there anything you would add?

Thanks,
Dave.

4 Likes

Hey @dneary these are great optimizations and there is a lot covered here. Have a couple other notes to add. I’ve been running a set of Opensearch clusters and sharing some insights from that work. Opensearch is a fork of Elasticsearch and suspect these tunings could also apply for optimizing both services.

Infrastructure: It is important to reduce the number of hops from Database → Server when running a Client/Server topology. This is fairly intuitive, but often easy to overlook. The results can lead to lower than expected performance. A hypothetical example would be comparing two platforms not similarly located on a network and comparing their respective throughput / latency. A simple ping or traceroute test can validate network topology meets expectations.

SLC as L3: This is a setting that has to be configured from the BIOS at boot time, and changes how the kernel schedules threads to more effectively utilize system memory. When running Opensearch-Clusters, I have observed ~10% increase in CPU utilization for most observed query operations resulting in up to a 30% improvement in throughput (ops/s) and p99 latency (ms).

Hope this helps other engineers better optimize their applications. Happy testing! :slight_smile:

Interesting! What is the difference between SLC and L3 cache? In my mind they are more or less interchangeable, but 10% is a big difference! I see SLC as a shared pool of cache, and I’ve been interested to see how MPAM in AmpereOne (what we call ā€œMemory Quality of Serviceā€) can be used to essentially tell processes not to use (or to limit their use of) SLC. I’m curious what the ā€œUse SLC as L3 cacheā€ setting changes in behavior that could explain their performance difference.

Tagging @hrw or @bexcran - do either of you know?

Over on LinkedIn, Ben Evans, author of the book ā€œOptimizing Cloud Native Javaā€, reached out - an article he wrote for developers.redhat.com is chock full of great advice about running Java in the cloud! Best practices for Java in single-core containers | Red Hat Developer

He also pointed me at the wonderful resource Containerize your Java Applications - Java on Azure | Microsoft Learn

They are interchangeable. The ā€˜SLC as L3 Cache’ option tells the OS about the presence of the SLC so it can schedule things differently. When it’s disabled the OS doesn’t know it exists and it’s handled transparently by the CPU.

3 Likes

I found one online resource that this results in performance improvements specifically in 1P systems in mono mode - is that right? Does that specific configuration allow avoiding some SLC overhead that you would have in 2P systems or a different NUMA config?

ā€˜SLC as L3 Cache’ can only be enabled in 1P, monolithic mode. Since it’s memory-side, in other configurations it can’t be attached to an ACPI processor node.

From https://support.hpe.com/hpesc/public/docDisplay?docId=sd00003788en_us&page=GUID-9892A1F6-EDAB-4544-90A1-DD8816E08078.html:

Use SLC as L3 cache to enable or disable using the SLC as L3 Cache and improve system performance in 1P systems. The SLC is not a traditional processor-side L3 or L4 cache. The SLC is a memory-side cache. For 1P systems in monolithic ANC mode, the SLC functions as a traditional 16 MB L3 cache.

Yup! That was the resource I found. Thanks.

1 Like

Tagging some additional references (noting them here for myself):

1 Like

You can suggest some methods that can help monitoring or profiling Java applications like Grafana Pyroscope.

We can not optimize what we can not measure.

I think this is the reason this option is not enabled by default in the BIOS.

1 Like