Looking for additional ideas: Optimizing Java applications for the cloud

dneary · May 2, 2025, 12:12am

Hi all,

I have another in-draft technical article I would love to get community input on what should be in (and potential not in) scope. I would like to document all of the potential ways in which you can impact the performance of Java applications running in the cloud (either in VMs or managed by Kubernetes). I have a few starting points - many of them, as you would expect, are not Arm64 specific - but of course I would love to highlight issues that will (positively) impact performance on Ampere especially.

So far, the scope of the article is:

JVM configuration:
- Use a newer Java - Java 21 is over 30% faster out of the box than Java 8 on Arm64, just because of the optimization work (intrinsics, NEON implementations of core routines, general optimization) that has been done in the meantime
- Set heap size to 75-85% of available memory - cloud applications are not running with other applications, and if you set max heap to less, you are wasting money
- Choose the right GC - Java ergonomics may well pick SerialGC (the absolute worst choice in environments where you have multiple cores) because of either Kubernetes config or because of limited awareness of available memory - if in doubt, G1GC is a great default choice. For latency sensitive and memory intensive applications, ZGC or Shenandoah might be better choices
- For memory-intensive workloads, using Transparent Huge Pages and a larger kernel page size can give you larger chunks of contiguous memory, and using -XX:+AlwaysPreTouch -XX:+UseTransparentHugePages will ensure that the memory is pre-warmed for use (costs start-up time in exchange for reduced memory latency in use)
- Try trading off start time for faster runtime by pre-compiling code using -XX:-TieredCompilation -XX:ReservedCodeCacheSize=64M -XX:InitialCodeCacheSize=64M
Manage resources in Kubernetes
- In the JVM configuration, regardless of how many millicores you are assigning to the workload, use -XX: ActiveProcessorCount greater than 2 - Kubernetes resource allocations are quotas for time across all available cores, not a restriction to run on a single core
- Explicitly set memory and CPU allowances in the Kubernetes pod configuration - otherwise the JVM ergonomics may detect that they have more or less resources available than they have - before Java 17, a container detected all of the host memory as being available to it, even if many containers were running!
- Use Arm64 instances to reduce cost if you have the option! In Kubernetes, this is as simple as ensuring that your Java application is available in an Arm64 container image, then setting "nodeAffinity"to prefer running on Arm64 instances if they are available. If you are running on a managed Kubernetes like OKE, the A1 instance type, powered by Ampere, offers considerable price and performance benefits over x86 equivalents.
- Basically echo the advice in “What’s New in Java 17 - container awareness”
Configure your OS to use what you’re paying for
- Be aware that tail latencies and throughput are often in competition - configuring a system for predictable response time will impact median response time, maximizing throughput may result in higher tail latencies and longer queue times
- Use tuned to set the CPU performance governor appropriately - the OS is tuned for “balanced” power to performance by default. If you care about performance, optimize your system for it
- network-throughput tunes buffer sizes for both the OS and network to maximize throughput (increased buffer sizes) and turns off power management features that turn off devices when idle
- network-latency reduces buffer sizes to ensure that traffic is treated as quickly as possible - but will impact throughput
- tuned-adm supports dozens of profiles out of the box, and supports creating new ones that encapsulate kernel and hardware configuration to optimize for specific use-cases - see docs for more
- Try a 64K kernel page size kernel if your application is memory-intensive
- Setting a longer jiffy size by setting kernel tick frequency lower reduces context switches and interrupts. Try compiling your kernel with CONFIG_HZ_100 for build servers

I was going to include something about optimizing memory access modes using VarHandle to avoid unnecessary barrier instructions in multithreaded environments, but frankly I don’t understand it well enough to explain it!

Beyond this set of advice, are there any big things I’m missing? I feel like Java code out of the box is pretty close to parity in performance with x86 at this point.

Is there anything you would add?

Thanks,
Dave.

Treinhart · May 2, 2025, 4:23pm

Hey @dneary these are great optimizations and there is a lot covered here. Have a couple other notes to add. I’ve been running a set of Opensearch clusters and sharing some insights from that work. Opensearch is a fork of Elasticsearch and suspect these tunings could also apply for optimizing both services.

Infrastructure: It is important to reduce the number of hops from Database → Server when running a Client/Server topology. This is fairly intuitive, but often easy to overlook. The results can lead to lower than expected performance. A hypothetical example would be comparing two platforms not similarly located on a network and comparing their respective throughput / latency. A simple ping or traceroute test can validate network topology meets expectations.

SLC as L3: This is a setting that has to be configured from the BIOS at boot time, and changes how the kernel schedules threads to more effectively utilize system memory. When running Opensearch-Clusters, I have observed ~10% increase in CPU utilization for most observed query operations resulting in up to a 30% improvement in throughput (ops/s) and p99 latency (ms).

Hope this helps other engineers better optimize their applications. Happy testing!

dneary · May 2, 2025, 6:58pm

Interesting! What is the difference between SLC and L3 cache? In my mind they are more or less interchangeable, but 10% is a big difference! I see SLC as a shared pool of cache, and I’ve been interested to see how MPAM in AmpereOne (what we call “Memory Quality of Service”) can be used to essentially tell processes not to use (or to limit their use of) SLC. I’m curious what the “Use SLC as L3 cache” setting changes in behavior that could explain their performance difference.

Tagging @hrw or @bexcran - do either of you know?

dneary · May 2, 2025, 7:35pm

Over on LinkedIn, Ben Evans, author of the book “Optimizing Cloud Native Java”, reached out - an article he wrote for developers.redhat.com is chock full of great advice about running Java in the cloud! Best practices for Java in single-core containers | Red Hat Developer

He also pointed me at the wonderful resource Containerize your Java Applications - Java on Azure | Microsoft Learn

bexcran · May 2, 2025, 7:47pm

They are interchangeable. The ‘SLC as L3 Cache’ option tells the OS about the presence of the SLC so it can schedule things differently. When it’s disabled the OS doesn’t know it exists and it’s handled transparently by the CPU.

dneary · May 2, 2025, 7:59pm

I found one online resource that this results in performance improvements specifically in 1P systems in mono mode - is that right? Does that specific configuration allow avoiding some SLC overhead that you would have in 2P systems or a different NUMA config?

bexcran · May 2, 2025, 8:46pm

‘SLC as L3 Cache’ can only be enabled in 1P, monolithic mode. Since it’s memory-side, in other configurations it can’t be attached to an ACPI processor node.

From https://support.hpe.com/hpesc/public/docDisplay?docId=sd00003788en_us&page=GUID-9892A1F6-EDAB-4544-90A1-DD8816E08078.html:

Use SLC as L3 cache to enable or disable using the SLC as L3 Cache and improve system performance in 1P systems. The SLC is not a traditional processor-side L3 or L4 cache. The SLC is a memory-side cache. For 1P systems in monolithic ANC mode, the SLC functions as a traditional 16 MB L3 cache.

dneary · May 2, 2025, 8:54pm

Yup! That was the resource I found. Thanks.

dneary · May 2, 2025, 9:00pm

Tagging some additional references (noting them here for myself):

Bruno Borges (not yet a member here) shared links to his Java performance talks (which were inspiration for many of my tips): Article: Secrets of Performance Tuning Java on Kubernetes
We have already got some Java performance tuning tips up here! Unlocking Java Performance Tuning Guide
I have previously raised this question in a couple of other topics here too:
- What’s the best JVM for Ampere? Best version of Java for Ampere?
- Platform-related JVM settings: Platform-related JVM settings?
Graviton has a Java performance tips page that also applies to Ampere CPUs: aws-graviton-getting-started/java.md at main · aws/aws-graviton-getting-started · GitHub

quocbao · May 8, 2025, 3:23am

You can suggest some methods that can help monitoring or profiling Java applications like Grafana Pyroscope.

We can not optimize what we can not measure.

David.Zeng · May 9, 2025, 1:19am

I think this is the reason this option is not enabled by default in the BIOS.

dneary · May 14, 2025, 6:44am

I’m struggling with this topic… It’s big!

I’m wondering whether one, two, or three articles is the right number.

I’ve been thinking about:

JVM choices, no matter what architecture you are using: heap size, garbage collection
Linux configuration: page size, HugePages, tuned profile
Architecture specific: 64k pages, MPAM, anything else architecture specific
Kubernetes resource allocation, container awareness

But I’m thinking it might be better as one article rather than 2-4. Thoughts?

poddingue · May 14, 2025, 7:39am

I would suggest going with several articles. Even though the subjects are closely linked, presenting it all at once might be too overwhelming to digest.

quocbao · May 14, 2025, 9:01am

Why don’t you just focus on monitoring/profiling methods and just let people make their own decisions?

dneary · May 15, 2025, 3:25am

That is a good suggestion - but I do want to tie it to the architecture somehow!

Topic		Replies	Views
Feb 2026: New OCI Instance, Tuning Java Applications, Performance Methodology Newsletters	0	34	February 19, 2026
Platform-related JVM settings? General Discussion	8	1058	June 11, 2024
Ampere Developer Newsletter- March 24, 2026- KubeCon Edition Newsletters	0	30	March 30, 2026
Upcoming Talk: JVM on ARM Processors with Dave Neary Events	1	33	April 9, 2026
Newsletter Preview: Is your JVM fighting your Kubernetes scheduler? (The 2026 State of Java) General Discussion	0	42	March 19, 2026

Looking for additional ideas: Optimizing Java applications for the cloud

Related topics