lu_zero
February 25, 2026, 7:58pm
1
I started testing a bit and I’m wondering if I’m doing something wrong:
llama-bench -m .cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -t 96 -p 2048 -n 256 -r 3
model
size
params
backend
threads
test
t/s
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
96
pp2048
125.21 ± 0.23
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
96
tg256
7.34 ± 0.00
quocbao
February 28, 2026, 8:58am
2
Try reducing the number of threads to around 32. You will hit memory bottleneck if using too many threads.
1 Like
quocbao
February 28, 2026, 9:09am
3
This is my result which using only 32 threads. While multiple factors influence CPU inference performance, memory bandwidth remains the most significant bottleneck.
You can also try to compile llama.cpp with BLAS backend
1 Like
dneary
March 2, 2026, 4:53pm
4
What result were you expecting? I’m not familiar with how Qwen works under the covers, I’d love to know whether this is particularly bad.
Dave.
❯ llama-bench -m .cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -t 32 -p 2048 -n 256 -r 1
model
size
params
backend
threads
test
t/s
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
32
pp2048
89.67 ± 0.00
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
32
tg256
7.78 ± 0.00
build: b68a83e64 (8138)
the tg256 is not great, getting to 10+ would make it bearable.
@lu_zero How many memory sticks did you use? I would suggest using all DDR memory slots to maximize memory bandwidth that the inference process can use.
that’s what I did. BLAS reduces the speed further down to ~20t/s for pp2048 and makes no difference for tg256
1 Like