Trade-offs between L1 I-cache size and branch misprediction recovery

dneary · September 9, 2024, 5:45pm

Hi all, I wanted to share something I learned from the Hot Chips presentation by Matt Erler that Patrick Kennedy covered on Serve The Home - and a question/confusion that I have around this detail.

Take a look at the “AmpereOne core pipeline: Fetch, Decode, and Rename” slide below, and notice two details:

AmpereOne has a branch misprediction recovery time of 10 cycles (which, at 3.2 GHz, is about 3ns) - which is exceptional in the industry
AmpereOne cores have 16 KB of L1 Instruction cache (I-cache) per core - which is less than competing cores (Ampere Altra and Altra Max had a more typical 64 KB I-cache per core)

As I understand it: branch prediction is the processor “guessing” what the most likely next few instructions are, and executing them speculatively (so that if the processor is right, the result is available immediately). This Wikipedia page describes that the computational pipeline has a few steps (fetch, decode, execute, return) that can operate in parallel.
Pipeline,_4_stage.svg

At conditional evaluations, the processor can guess which branch is more likely and start fetching and decoding instructions from that location before the conditional statement is complete. A successful prediction keeps the CPU chugging along, a failed prediction causes some already-completed work to be discarded, and we been to reset and fetch a different instruction, and “catch up” to where we were before. That cost is the branch misprediction recovery time. What I am unclear on is how the smaller L1 I-cache helps shorten that recovery time.

In addition, I wondered whether the L1 I-cache size would cause a performance hit because with a smaller cache, you should have more cache misses in your I-cache (which would cause the cache lines to be fetched from L2 I-cache). But what Matt shared is that for compute-heavy cloud workloads, the cache miss rate is only a little higher with a 16K vs 64K L1 I-cache. And by focusing on high bandwidth and low latency between L1 and L2 caches, we end up faster overall than we would with a larger L1 I-cache.

The questions I have from this are:

How does a smaller L1 I-cache help shorten branch misprediction recovery time?
Why doesn’t this impact cache hit rate more than it does?
Is this anything that user space developers need to have awareness and understanding of? Maybe compiler developers?

quocbao · September 10, 2024, 2:35am

By the way, as I observed in my recent research, in some tests with Nginx + Lua, Q64-22 has a much lower cache miss rate compared to Ryzen 5965WX, and L1/L2 cache size of Q64-22 is double compared to 5965WX.

bexcran · September 22, 2024, 3:51pm

GCC has macros that can be used to tell it which branch is more likely to be taken, so it can optimize the layout of the code. In the Linux kernel they’re available via the likely/unlikely macros.
On x86 there are also branch hint prefixes which tell the CPU itself which branch is more likely to be taken, but I’m not aware of anything similar for ARM.

From Other Builtins (Using the GNU Compiler Collection (GCC))

Built-in Function: long __builtin_expect (long exp, long c)

You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.

The return value is the value of exp, which should be an integral expression. The semantics of the built-in are that it is expected that exp == c. For example:
   if (__builtin_expect (x, 0))
     foo ();
indicates that we do not expect to call foo, since we expect x to be zero. Since you are limited to integral expressions for exp, you should use constructions such as
   if (__builtin_expect (ptr != NULL, 1))
     foo (*ptr);
when testing pointer or floating-point values.

For the purposes of branch prediction optimizations, the probability that a __builtin_expect expression is true is controlled by GCC’s builtin-expect-probability parameter, which defaults to 90%.

You can also use __builtin_expect_with_probability to explicitly assign a probability value to individual expressions. If the built-in is used in a loop construct, the provided probability will influence the expected number of iterations made by loop optimizations.

Built-in Function: long __builtin_expect_with_probability

(long exp, long c, double probability)

This function has the same semantics as __builtin_expect, but the caller provides the expected probability that exp == c. The last argument, probability, is a floating-point value in the range 0.0 to 1.0, inclusive. The probability argument must be constant floating-point expression.

Topic		Replies	Views
Does PRFM L3 prefetch to SLC or is treated as NOP? Q&A	1	92	September 9, 2024
"What Developers Should Know About Hardware Architecture" General Discussion	3	112	January 22, 2026
Generative AI: why CPU inference is growing Content and Articles	2	315	April 19, 2024
What are the key performance differences between Ampere Altra and AmpereOne processors in real-world cloud workloads? General Discussion ampere	2	198	June 13, 2025
Weekend Read: The First 10 Questions to Answer While Running on Ampere Altra-Based Instances Content and Articles ampere , devto	2	468	May 30, 2023

Trade-offs between L1 I-cache size and branch misprediction recovery

Related topics