Qwen3.5-35B-A3B benchmarks on AmpereOne

I started testing a bit and I’m wondering if I’m doing something wrong:

llama-bench -m .cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -t 96 -p 2048 -n 256 -r 3

model size params backend threads test t/s
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 96 pp2048 125.21 ± 0.23
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 96 tg256 7.34 ± 0.00

Try reducing the number of threads to around 32. You will hit memory bottleneck if using too many threads.

1 Like

This is my result which using only 32 threads. While multiple factors influence CPU inference performance, memory bandwidth remains the most significant bottleneck.

You can also try to compile llama.cpp with BLAS backend

1 Like

What result were you expecting? I’m not familiar with how Qwen works under the covers, I’d love to know whether this is particularly bad.

Dave.

❯ llama-bench -m .cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -t 32 -p 2048 -n 256 -r 1

model size params backend threads test t/s
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 32 pp2048 89.67 ± 0.00
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 32 tg256 7.78 ± 0.00

build: b68a83e64 (8138)

the tg256 is not great, getting to 10+ would make it bearable.

@lu_zero How many memory sticks did you use? I would suggest using all DDR memory slots to maximize memory bandwidth that the inference process can use.

that’s what I did. BLAS reduces the speed further down to ~20t/s for pp2048 and makes no difference for tg256

1 Like