Arm64 performance and Arm memory model (barriers)

dneary · August 16, 2024, 1:02am

This is part me sharing some stuff I learned recently, part questioning my understanding.

So there’s a class of performances fix in the JVM that I’ve come across recently, which is removing unnecessary barrier instructions. I find the whole thing fascinating, but also I have questions.

What are barrier instructions?

There are three types of barrier instructions:

Instruction Synchronization Barrier (ISB) ensures that all pending instructions are fetched and ready to execute based on current MMU configuration
Data Memory Barrier (DMB) ensures that memory accesses are executed in order across the barrier
Data Synchronization Barrier (DSB) is the same as a DMB, but it also prevents any instructions from running that might change memory that would be accessed across the DSB instruction

I might be a little off with my definition of the instructions - but basically, these create something like what is called a sequence point in C. Sometimes, multiple load or store instructions can execute in any order (say two writes to different memory locations, or multiple reads with no writes in between) - in those situations, the processor may choose to execute these instructions in any order. This comes from Arm’s “soft” memory model, whereas x86 has more strict ordering of memory reads and writes.

The problem, I think, is that sometimes compilers are a little zealous with barrier instructions, and can result in “stopping the world” to let pending instructions catch up before proceeding, and can slow down the execution of a program compared to other architectures, especially if the barriers are unnecessary.

But, if I understand correctly (and I guess this is the question): it seems like the “soft” memory model was intended to speed up execution by handing a little more flexibility to the processor - but barrier instructions by design create sync points that can slow execution down.

So - do we have a situation where a processor adopted a memory model which was intended to make it faster, but now compilers are using barriers too much, making things slower overall - is that right?

vielmetti · August 16, 2024, 7:18pm

Dave, is this the same or different from “speculative execution”, the chip features that caused so many security issues on other architectures?

I know I’ve seen issues in the Go issue tracker about improving barriers for performance reasons, with big speedups realized in some places. Not surprised that the JVM has the same characteristics.

vielmetti · August 16, 2024, 7:23pm

I think this is exemplary (but I’m not 100% sure)

github.com/golang/go

runtime: optimize write barrier

opened 02:26AM - 27 Oct 17 UTC

closed 04:35PM - 13 Feb 18 UTC

aclements

FrozenDueToAge

This issue is to track work on optimizing the write barrier. Our current writ…e barrier is fairly inefficient simply because not much effort has been put into optimizing it. For most applications, the write barrier has little overhead simply because it's rarely enabled. However, for some applications (e.g., the compiler) it consumes a few percent of the CPU. Furthermore, to fix #14951, we have to reduce the CPU consumed by GC, which necessarily means that GC and the write barrier will be enabled more of the time, increasing the impact of write barrier overhead. I propose switching to a "buffered" write barrier, in which a fast path simply enqueues the necessary pointers to a per-P buffer. This fast path can be done without a normal Go call, avoiding the cost of spilling registers around the write barrier. When the buffer fills up, the write barrier will enter the slow path, which will spill *all* registers and enter the runtime to flush the buffer. We can disallow stack splits and safe-points during flushing so we don't need type information for the spilled registers. I've already implemented this for the single pointer barrier on amd64. It's ~4X faster than the current barrier, speeds up GC-heavy applications by ~2%, and reduces binary size by ~1.5%. I haven't yet implemented it for other architectures, or used these techniques to improve the bulk write barriers. /cc @RLH @josharian

dneary · August 16, 2024, 10:56pm

I believe this is different, as decribed here: Memory ordering . This allows an instruction reading a register to do so before another instruction which is writing a different memory address, even if that instruction was second in order to be run, if the CPU evaluates that this would be more efficient.

As I understand it (which, to be fair, might not be very well), there should be inherent protections around reordering memory accesses without DMB or DSB barriers - so I’m not sure what potential harm they are preventing. But that article I linked does point to some limitations - and maybe the barrier instructions are ways for a compiler to say “I know this is unsafe if you reorder things, so don’t touch” or “we really need this last instruction to finish before you continue”. Would love to know more.

@bexcran Am I even in the right neighborhood to understand what’s going on here?

dneary · August 16, 2024, 11:01pm

I’ve come across a few articles that helped a little, but I’m still scratching the surface.

Edit: Ooh - this presentation (so far) is doing a good job of weak ordering, when to use barriers (they’re not inserted by the C compiler, by the way - but they are in the JVM source core), and how to reduce your use of them safely: Arm’s Weakly-Ordered Memory Model and Barrier Requirements

Topic		Replies	Views
"Java performance on Arm64" presentation from Devoxx BE Events	1	58	November 21, 2024
What the hell is IOMMU? General Discussion	2	2310	June 10, 2024
Unstable benchmark results General Discussion oci , ampere	13	562	July 26, 2024
SIMD.info is live! Events	2	48	September 20, 2024
Platform-related JVM settings? General Discussion	8	843	June 11, 2024

Arm64 performance and Arm memory model (barriers)

What are barrier instructions?

Related topics