64K Memory page sizes - any experiences to share?

Hi all,

I’m writing a fairly brief overview of the differences between 4K and 64K kernel page sizes - what a bigger kernel page size means, what types of applications benefit from larger contiguous chunks of memory, and how to use larger page sizes on various distributions.

I’d love to hear feedback from anyone who has used larger page sizes, and whether they saw significant performance benefits for things like running VMs, databases, or AI inference workloads when running with 64K kernels - and whether there were potential downsides (like, say, software assuming that kernel page sizes were 4K) that you encountered.

I would also love to know how to configure compute nodes for hosted Kubernetes offerings to boot a custom kernel or do some operating system configuration using something like cloud-init, and tag those nodes for specific K8s workloads, if anyone knows how to do that.

I’ll link to the tutorial when it’s published, and hopefully some nuggets from this topic will end up there!

Thanks,
Dave.

6 Likes

Sharing some of the code tidbits, thank you @Karsten for the following suggested perf recipe to see if your TLB miss cost would merit changing kernel size:

# perf list | grep end_tlb  
stall_backend_tlb
stall_frontend_tlb

With kernel support confirmed the pipeline stalls due to TLB misses can be measured:

# perf stat -e instructions,cycles,stall_frontend_tlb,stall_backend_tlb ./a.out 
time for 12344321 * 100M nops: 3.7 s 
 
Performance counter stats for './a.out': 

12,648,071,049 instructions # 1.14 insn per cycle  
11,109,161,102 cycles  
1,482,795,078 stall_frontend_tlb  
1,334,751 stall_backend_tlb  
 
3.706937365 seconds time elapsed 
3.629966000 seconds user 
0.000995000 seconds sys 

The ratio (stall_frontend_tlb + stall_backend_tlb)/cycles is an upper bound for the time that could be saved by using larger memory pages.

I also share how to install and boot 64K kernels for RHEL, Oracle Linux, and Ubuntu (I’d love to share instructions for SLE too, but could not find a resource on it).

To use a kernel with 64KB pages on Red Hat Enterprise Linux 9:

  1. Install the kernel-64k package: dnf –y install kernel-64k

  2. To enable the 64K kernel to be booted by default at boot time:

    k=$(echo /boot/vmlinuz*64k)
    grubby --set-default=$k \
    --update-kernel=$k \
    --args="crashkernel=2G-:640M"

To boot a 64KB kernel on Ubuntu 22.04:

  1. Install the arm64+largemem ISO which contains the 64K kernel by default, or:

  2. Install the linux-generic-64k package, which will add a 64K kernel option to the boot menu with the command sudo apt install linux-generic-64k

  3. You can set the 64K kernel as the default boot option by updating the grub2 boot menu with the command:
    echo "GRUB_FLAVOUR_ORDER=generic-64k" | sudo tee /etc/default/grub.d/local-order.cfg

For 64KB pages on Oracle Linux:

  1. Install the kernel-uek64k package:
    sudo dnf install -y kernel-uek64k

  2. Set the 64K kernel as the default at boot time:
    sudo grubby --set-default=$(echo /boot/vmlinuz*64k)

  3. After rebooting the system, you can verify that you are running the 64K kernel using getconf.

Finally, to confirm that you’re running a 64K kernel, you can run getconf PAGESIZE:

$ getconf PAGESIZE 
65536
2 Likes

I’ve noticed Clang performs faster, I managed to go from 30 minutes to 22 minutes in LLVM compile times on Q64-22. Idk if it’d be useful but feel free to reference my blog post on 16k vs 4k page sizes. Theoretically, it should scale up fine and a lot of concepts I brought up still apply regardless of the actual size. Maybe I should do a refreshed post on 64k page sizes.

On NixOS, all I had to do was apply this to my configuration:

  boot.kernelPatches = [
    {
      name = "perf";
      patch = null;
      extraConfig = ''
        ARM64_64K_PAGES y
        HZ_100 y
      '';
    }
  ];
2 Likes

64k pages help to the extent that you need truly random access to large amounts of data. This is not always the case, most workloads have a significant sequential component, but repeated sequential access by enough readers starts to become random, so it’s usually at least somewhat true. Some back of the envelope math:

  • On an Ampere Altra, you have 64k of L1 data per core, and a page table entry is 8 bytes. Therefore, if your (truly random) working set is 32M, the page table entries for your working set have exhausted L1d, and you can expect that every fresh page access is going to go out to L2.
  • At 64k pages, a 32M working set needs only 4k of L1 for its PTEs. Your working set would need to grow to 512M to exhaust L1.
  • Coincidentally, on an Altra, you have 1M of L2, which means that at 4k pages a 512M working set’s PTEs would exhaust L2, and you could expect that every fresh page access would go all the way out to RAM just to load the page table entry (and then probably again for the page itself).
  • At 64k pages your working set would need to hit 8G before you would burn all of your L2 on PTEs.

This gets especially brutal if your random-access workload involves HMM access to a PCIE device. At 4k you can spend >90% of your time servicing page faults, and at 64k you can set record-setting benchmarks.

The downside is on machines with “small” amounts of memory 64k starts to get claustrophobic. 4G is a nice round number, and at 4G you have 64k pages to work with. A few dozen processes, each with a dozen-plus memory maps just for their own text and their dependent shared libraries, means each process can have maybe a hundred MB of private data before you cannot keep every process wired in RAM. And the “machine” here doesn’t need to be bare metal, it could be the cgroup limit for a container or the reservation for a virtual machine. That last point can be especially fun if you’re trying to map host resources from a 4G guest that needs to be 4k in order to have more pages and you request an alignment the host can’t satisfy…

In my opinion 16k would be the best general-purpose choice if more aarch64 systems supported it, it mitigates the page-thrashing problem while scaling to small environments. Since they don’t I would generally prefer 64k for almost any system with >8G RAM, almost regardless of workload once you hit 32G.

3 Likes

Yeah, I wonder how much it helps with LLM’s. LLM’s can be quite large so it certainly could decrease the time it takes to load by a good amount. Still limited by the speed of the storage device and filesystem limitations.

Personally, I’ve seen it help with C++ or large C code bases since some source files can get quite large. Not sure how much it helps with parsing and lexing.

Why only 4K and 64K? There is 16K in between…

4K is useful/needed if you want to handle x86(-64) emulation. Asahi folks handle that with micro vm instance to have it working on 16K system (something like that).

HPC folks usually demand 64K page size cause for their workloads it makes enough of difference. But I cannot quote numbers.

I certainly talk about 16K as a page size, but as most distros only package 4K by default and 64K as an option, that would require a kernel rebuild for most people - and as you say, you see the biggest differences at the extremes.

1 Like

Thanks for the link to your blog post! Very useful information.

1 Like

4k offers the most compatibility because since paging commonly came about in the i386 processor. It worked off a 4k page size and supported nothing else. 40 years later and most people still use a 4k page size.

My post and @nwnk sort of goes into this. The idea is that memory is provided in blocks called pages, theoretically you get more performance by having to allocate less pages. By increasing the page size, you can fit a proportional amount of bytes into that page now without allocating more pages. So if you have something that is like 8GB in size, a 4k page size will take a lot. But a 64k page size will take significantly less. Because the kernel is responsible for setting up the page mapping and allocating the pages, it takes n amount of time for the kernel to give the userspace the page. The kernel also has to communicate this with the MMU so the pages are correctly invalidated and mapped for address translation.

1 Like

Yes and this is actually where Ampere is great at this. Even with requiring a kernel rebuild, a lot of systems take time to build a kernel. On Ampere, I’ve been able to compile a kernel in under 15 minutes. That is still with a lot of things that I likely will never use. If I were to strip down the kernel, it’ll be a lot smaller and faster to build. The Asahi Linux kernel, which runs on my laptop, takes about 8 minutes to build and is a stripped down kernel.

Looking at the LLVM build times I got with a 64k page size, it is an 8 minute difference. The time you can have an Ampere system build a kernel is worth the potential performance gains.

1 Like

With a few tweaks, our Linux team got a full kernel compile on a 128 core Ampere Altra Max system down to 5’15"! If I recall correctly, the biggest changes were:

  • Set the kernel jiffy size was as large as possible (HZ=100)
  • Using 64K page size was fastest, but 16K provided a little less variability (I didn’t really understand what the issue was with 100HZ and 64K pages but there was one)
  • A good fan reduced the average temperature and shaved some more time
  • Booting the system with NUMA mode mono, not quad or demi, shaved a few more seconds
  • make -j128 was optimal - any more than 128 and you start to see more context switching

Some lessons learned in the process included:

  • Make sure ccache is off if you want to get real “from scratch” numbers
  • Use ministat and multiple runs to detect and reduce statistical variance

And I believe that this particular kernel developer was very happy with the result.

1 Like

It might be good to find out what the exact issue is.

Huh, what do the NUMA nodes look like then? I use hemi and I get 4 with 32 cores per node. Performance seems good with hemi.

My understanding if the NUMA BIOS setting is that you can choose one (mono), two (hemi), or four (quad) NUMA nodes per socket. So with hemi you should have two. lscpu should show how many NUMA nodes there are, and which cores are in which nodes. Do you have a two-socket Q64 or a single M128?

1 Like

Single socket M128-26.

Interesting! And you’re sure you’re in hemi mode, not quad?

Yes, I set it to hemi.

@jan suggests that it is beneficial for some LLM cases, and shouldn’t harm the others, so cant go wrong with 64k.

2 Likes

It would be interesting to know if HZ=1000 and some/mostly nohz_full CPUs might perform closer to HZ=100. I’ve long suspected that the dynamic tick code could be smarter such that HZ=100 would be rarely necessary.

1 Like

Using Ubuntu 22.04 / 24.04 and using 128-30 I have noticed that:

  1. Idle memory was higher by 6-10GB
  2. Python, PyTorch, Pandas were performing at least 30% faster
  3. Workstation worked very stable in any task.
  4. Compiling in Clang, NVCC, ninja, bazel was ~20% faster.
  5. Pip install of packages was slightly faster.

If you have any specific question about it please feel free to ask.

2 Likes

Oops - “hemi” instead of “demi” for NUMA mode.