64K Memory page sizes - any experiences to share?

I saw ~10% speed up running single threaded version of the gzip test in Cloudflare’s CF_Benchmark available at GitHub - cloudflare/cf_benchmark running Ubuntu 24.04 6.8 64K vs. 4K kernel and bigger speedups running the multi-threaded version.

I found that using JEMalloc provided speedup and can be tested like this:

Build latest JEMalloc version:
git clone GitHub - jemalloc/jemalloc; cd jemalloc; ./autogen.sh; make –j; make install #
# run using LD_PRELOAD to force loading jemalloc before malloc
export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2

6 Likes

jemalloc is an interesting case! By default, until recently, it had hard-coded 4K page alignment into the code. They recently changed the project to support 16K and 64K page size by default. I’m not aware of any list of projects susceptible to issues with larger page sizes, or any tools you can run across a dependency tree to see if there are alignment issues, but I would love to know if one exists!

1 Like

Have you ran the tests? Ime, I’ve seen it break with a 64k page size. In nix, we notably have Build failure: pkgsLLVM.jemalloc · Issue #348660 · NixOS/nixpkgs · GitHub open for this problem but haven’t spent the time to fix it.

The fix is to build the jemalloc package for NixOS with ./configure --with-lg-page=16. If the patch I submitted is accepted, this will be the default from now on.

1 Like

Yeah, that’s what we currently do. However, the tests fail which is what I’m wondering about. 5.3.0 segfault in test/unit/psset on aarch64 · Issue #2408 · jemalloc/jemalloc · GitHub

1 Like

Oh sorry, I spoke too soon.

1 Like

Look like jemalloc don’t want to support unexpectedly large page size :smiley:

/*
 * Used to validate that the hugepage size is not unexpectedly high.  The huge
 * page features (HPA, metadata_thp) are primarily designed with a 2M THP size
 * in mind.  Much larger sizes are not tested and likely to cause issues such as
 * bad fragmentation or simply broken.
 */
#define HUGEPAGE_MAX_EXPECTED_SIZE ((size_t)(16U << 20))
hpa_hugepage_size_exceeds_limit(void) {
        return HUGEPAGE > HUGEPAGE_MAX_EXPECTED_SIZE;
}
        /* As mentioned in pages.h, do not support If HUGEPAGE is too large. */
        if (hpa_hugepage_size_exceeds_limit()) {
                return false;
        }

Also found an email from jemalloc author

I think the most promising approach is to leave jemalloc’s notion of page size at 4 KiB, set the chunk size to be at least as large as the huge page size, and disable dirty page purging. This allows the huge pages to be carved up with 4 KiB granularity for small/large allocations, and assures that chunks comprise distinct sets of huge pages. Dirty page purging would be at best a waste of time in this set up, probably with no effect.

2 Likes

I may have discovered another piece of software which breaks with a 64k page size. Originally, I thought it was an issue with my toolchain and it being LLVM. However, GCC toolchain builds fail as well. The package is cryptsetup and one of the unit tests fail. In particular, the compat-test fails with this error:

CASE: Image in file tests (root capabilities not required)
[1] format
Device wipe error, offset 4096.
Cannot wipe header on device luks-test.
FAILED backtrace:
239 ./compat-test

Building the package is pretty much standard autotools and the tests can run from make check. Alternatively, nix is the easiest way to see with nix build --extra-experimental-features 'nix-command flakes' github:NixOS/nixpkgs#cryptsetup --rebuild.

Since I see this test failure on Ampere Altra Max M128-26 with a 64k page size, I’d like to ask others who have Ampere hardware with and without the 64k page size to see if they can reproduce this failure.

1 Like

Ouch! Unfortunately that doesn’t help when the kernel granule is larger than 4K and your minimum page allocation is 4K (of course jemalloc could probably allocate 4K chunks within those larger pages and not say “page size not supported”, but I imagine that’s more work to support.

On a plain vanilla Rocky Linux installation:

  • Linux BigAarch64 5.14.0-503.38.1.el9_5.aarch64 #1 SMP PREEMPT_DYNAMIC Wed Apr 16 14:18:07 EDT 2025 aarch64 aarch64 aarch64 GNU/Linux
  • 2 Q80-30 processor system with 128GB DDR4 RAM and 4.2 TB nvme disk
  • Linux stable v6.12.28 full build after make mrproper
  • time make -j$(nproc) 2>errs.txt

|real|4m35.197s|
|user|360m5.658s|
|sys|127m32.236s|

I’m happy with that sort of performance. Since building and testing kernels is a big part of my work I don’t see any reason to switch to a larger page size. But obviously YMMV depending on what work is being done.

3 Likes

On a plain vanilla Rocky Linux installation:

  • Linux BigAarch64 5.14.0-503.38.1.el9_5.aarch64 #1 SMP PREEMPT_DYNAMIC Wed Apr 16 14:18:07 EDT 2025 aarch64 aarch64 aarch64 GNU/Linux
  • 2 Q80-30 processor system with 128GB DDR4 RAM and 4.2 TB nvme disk
  • Linux stable v6.12.28 full build after make mrproper
  • time make -j$(nproc) 2>errs.txt

|real|4m35.197s|
|user|360m5.658s|
|sys|127m32.236s|

I’m happy with that sort of performance. Since building and testing kernels is a big part of my
work I don’t see any reason to switch to a larger page size. But obviously YMMV depending on what work is being done.

Nice! I am curious whether you are building from scratch completely, and whether it would be faster with a 64K kernel, HZ=100, and whether ccache is on with your 4’35" build time.

Dave.

I’ll give that a try and get back to you. I have to reboot the machine to get 64K pages, it’ll be an interesting test.

Oh, yes, it is after a ‘make mrproper’ - I presume that is what you mean by building from scratch.

1 Like

My colleague was seeing much shorter build times than he expected, and eventually tracked it down to ccache being enabled on a fedora by default - so the C compiler was caching some stuff. I don’t know too much about it, though.

OK, I have some results.

CONFIG_HZ_100=y

[g.v.rose@BigAarch64 linux-6.12.28]$ uname -a
Linux BigAarch64 5.14.0-503.40.1.el9_5.aarch64+64k #1 SMP PREEMPT_DYNAMIC Wed Apr 30 16:08:38 EDT 2025 aarch64 aarch64 aarch64 GNU/Linux
[g.v.rose@BigAarch64 linux-6.12.28]$ make mrproper
[g.v.rose@BigAarch64 linux-6.12.28]$ mv ../tmp-config .config
[g.v.rose@BigAarch64 linux-6.12.28]$ make olddefconfig
[g.v.rose@BigAarch64 linux-6.12.28]$ time make -j$(nproc) 2>errs.txt

|real|3m55.455s
|user|404m42.988s|
|sys|37m13.343s|

That is with ccache on, which it would be usually for obvious reasons.

An approximate 40 seconds improvement. Nice!

Let’s turn ccache off.
[g.v.rose@BigAarch64 linux-6.12.28]$ export CCACHE_DISABLE=1
[g.v.rose@BigAarch64 linux-6.12.28]$ make mrproper
[g.v.rose@BigAarch64 linux-6.12.28]$ mv ../tmp-config .config
[g.v.rose@BigAarch64 linux-6.12.28]$ make olddefconfig
[g.v.rose@BigAarch64 linux-6.12.28]$ time make -j$(nproc) 2>errs.txt
|real|3m55.252s|
|user|406m0.810s|
|sys|37m15.422s|

ccache disabled seems to be a no op. Pretty much the same result.

btop reports about 41GiB usage during the kernel build, which seems a bit high. I forgot to record memory usage with 4K pages. I’ll switch back to 4K pages and rerun the test and get back with memory usage in that scenario.

Interesting stuff tho - thanks for the fun times!

2 Likes

My 64K page size article just dropped! Unfortunately not all of the nuggets from this thread made it in - I’m already planning a follow up!

Understanding memory page sizes on Arm64

2 Likes

Great article! Thank you.

To follow up I retried with 4K pagesize and found btop reports about 20 GiB memory usage, roughly half that of the 64K pagesize scenario. That’s to be expected and no surprises there.

What I found interesting was the difference in sys usage - approximately 37m in the 64K pagesize test vs. 127m in the 4k pagesize test.

Now that’s something you can sink your teeth into! Less sys usage = more user usage and for many applications that’s going to be very important.

More memory is cheap these days, users getting more time for their application is big, especially in HPC.

1 Like

That’s awesome, would it be possible to put instructions for NixOS? heh

1 Like

If you can share what they would be, happy to share them somewhere!

1 Like

The instructions are:

Edit the system configuration, default path is /etc/nixos/configuration.nix. Add these lines:

boot.kernelPatches = [
  {
    name = "page-size";
    patch = null;
    extraStructuredConfig.ARM64_64K_PAGES = lib.kernel.yes;
  }
];

The run sudo nixos-reboot switch, the system will turn rebuild the kernel and you just need to reboot the system and make sure it is selected on the new NixOS generation entry in the bootloader.

Replying to myself here - 16K page size results. I had to build this kernel myself so I went ahead and used a recent longterm stable kernel, 6.12.28.

[g.v.rose@BigAarch64 linux-6.12.28]$ getconf PAGESIZE
16384

Linux BigAarch64 6.12.28-stable #2 SMP PREEMPT_DYNAMIC Sat May 10 11:12:24 PDT 2025 aarch64 aarch64 aarch64 GNU/Linux

Memory usage was only 1 or 2 GiB over 4K pagesize as reported by btop - that’s good.

It’s faster than 64K pagesize by 10 seconds or so and CPU sys usage was still much better than 4K page size.

|real|3m44.161s|
|user|374m59.935s|
|sys|47m28.384s|

Impressive! It’s not quite apples to apples comparison because I used the 6.12.28 kernel instead of the 5.14.0 RL 9 kernel as in the previous tests. But I’ll take it! I think I’ll run this kernel for a while.

1 Like