Hi all,
I have another in-draft technical article I would love to get community input on what should be in (and potential not in) scope. I would like to document all of the potential ways in which you can impact the performance of Java applications running in the cloud (either in VMs or managed by Kubernetes). I have a few starting points - many of them, as you would expect, are not Arm64 specific - but of course I would love to highlight issues that will (positively) impact performance on Ampere especially.
So far, the scope of the article is:
- JVM configuration:
- Use a newer Java - Java 21 is over 30% faster out of the box than Java 8 on Arm64, just because of the optimization work (intrinsics, NEON implementations of core routines, general optimization) that has been done in the meantime
- Set heap size to 75-85% of available memory - cloud applications are not running with other applications, and if you set max heap to less, you are wasting money
- Choose the right GC - Java ergonomics may well pick SerialGC (the absolute worst choice in environments where you have multiple cores) because of either Kubernetes config or because of limited awareness of available memory - if in doubt, G1GC is a great default choice. For latency sensitive and memory intensive applications, ZGC or Shenandoah might be better choices
- For memory-intensive workloads, using Transparent Huge Pages and a larger kernel page size can give you larger chunks of contiguous memory, and using
-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages
will ensure that the memory is pre-warmed for use (costs start-up time in exchange for reduced memory latency in use) - Try trading off start time for faster runtime by pre-compiling code using
-XX:-TieredCompilation -XX:ReservedCodeCacheSize=64M -XX:InitialCodeCacheSize=64M
- Manage resources in Kubernetes
- In the JVM configuration, regardless of how many millicores you are assigning to the workload, use
-XX: ActiveProcessorCount
greater than 2 - Kubernetes resource allocations are quotas for time across all available cores, not a restriction to run on a single core - Explicitly set memory and CPU allowances in the Kubernetes pod configuration - otherwise the JVM ergonomics may detect that they have more or less resources available than they have - before Java 17, a container detected all of the host memory as being available to it, even if many containers were running!
- Use Arm64 instances to reduce cost if you have the option! In Kubernetes, this is as simple as ensuring that your Java application is available in an Arm64 container image, then setting "nodeAffinity"to prefer running on Arm64 instances if they are available. If you are running on a managed Kubernetes like OKE, the A1 instance type, powered by Ampere, offers considerable price and performance benefits over x86 equivalents.
- Basically echo the advice in āWhatās New in Java 17 - container awarenessā
- In the JVM configuration, regardless of how many millicores you are assigning to the workload, use
- Configure your OS to use what youāre paying for
- Be aware that tail latencies and throughput are often in competition - configuring a system for predictable response time will impact median response time, maximizing throughput may result in higher tail latencies and longer queue times
- Use tuned to set the CPU performance governor appropriately - the OS is tuned for ābalancedā power to performance by default. If you care about performance, optimize your system for it
network-throughput
tunes buffer sizes for both the OS and network to maximize throughput (increased buffer sizes) and turns off power management features that turn off devices when idlenetwork-latency
reduces buffer sizes to ensure that traffic is treated as quickly as possible - but will impact throughputtuned-adm
supports dozens of profiles out of the box, and supports creating new ones that encapsulate kernel and hardware configuration to optimize for specific use-cases - see docs for more- Try a 64K kernel page size kernel if your application is memory-intensive
- Setting a longer jiffy size by setting kernel tick frequency lower reduces context switches and interrupts. Try compiling your kernel with
CONFIG_HZ_100
for build servers
I was going to include something about optimizing memory access modes using VarHandle
to avoid unnecessary barrier instructions in multithreaded environments, but frankly I donāt understand it well enough to explain it!
Beyond this set of advice, are there any big things Iām missing? I feel like Java code out of the box is pretty close to parity in performance with x86 at this point.
Is there anything you would add?
Thanks,
Dave.