lu_zero
February 25, 2026, 7:58pm
1
I started testing a bit and I’m wondering if I’m doing something wrong:
llama-bench -m .cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -t 96 -p 2048 -n 256 -r 3
model
size
params
backend
threads
test
t/s
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
96
pp2048
125.21 ± 0.23
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
96
tg256
7.34 ± 0.00
quocbao
February 28, 2026, 8:58am
2
Try reducing the number of threads to around 32. You will hit memory bottleneck if using too many threads.
1 Like
quocbao
February 28, 2026, 9:09am
3
This is my result which using only 32 threads. While multiple factors influence CPU inference performance, memory bandwidth remains the most significant bottleneck.
You can also try to compile llama.cpp with BLAS backend
1 Like
dneary
March 2, 2026, 4:53pm
4
What result were you expecting? I’m not familiar with how Qwen works under the covers, I’d love to know whether this is particularly bad.
Dave.
❯ llama-bench -m .cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -t 32 -p 2048 -n 256 -r 1
model
size
params
backend
threads
test
t/s
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
32
pp2048
89.67 ± 0.00
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
32
tg256
7.78 ± 0.00
build: b68a83e64 (8138)
the tg256 is not great, getting to 10+ would make it bearable.
@lu_zero How many memory sticks did you use? I would suggest using all DDR memory slots to maximize memory bandwidth that the inference process can use.
that’s what I did. BLAS reduces the speed further down to ~20t/s for pp2048 and makes no difference for tg256
1 Like
dneary
March 12, 2026, 9:38pm
9
Thanks @quocbao , @lu_zero - @bhakti , @jan can either of you help advise Lucas on what he should be seeing and whether there is anything he can do to improve performance?
Dave.
1 Like
@lu_zero Have you tried different ANC modes and used numactl to bind llama-bench to a specific NUMA node? I think the key issue here is finding the right balance between the number of CPU cores and the memory locality available to those cores.
1 Like
This is my test with ANC mode set to Hemisphere (two NUMA domains). By reducing threads from 32 to 20, I have doubled input throughput. More tests needed to find the best result.
Note that I just use 4 out of 8 memory slots for my Q80-30.
1 Like
the tg256 is still very low sadly, but the efficiency boost is huge!
❯ llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 32
model
size
params
backend
threads
test
t/s
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
32
pp2048
92.36 ± 10.54
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
32
tg256
7.04 ± 0.12
build: b68a83e64 (8138)
❯ llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 40
model
size
params
backend
threads
test
t/s
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
40
pp2048
92.58 ± 17.54
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
40
tg256
7.45 ± 0.27
build: b68a83e64 (8138)
❯ llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 80
model
size
params
backend
threads
test
t/s
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
80
pp2048
165.51 ± 1.63
qwen35moe ?B Q4_K - Medium
18.32 GiB
34.66 B
CPU
80
tg256
6.99 ± 0.13
I tried to set is a Quadrant and those are the results.
You have to use numactl to bind the llama-bench process to a specific NUMA node.
In the end, based on your benchmark, the command for you to serve the model should look like (assume that you are running Q80-30 in quadrant mode):
numactl --cpunodebind=0 --membind=0 \
llama-server \
-m unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
-p 2048 -n 256 -r 3 \
-tb 80 \ # use 80 threads for prefill phase
-t 20 \ # only use 20 threads for decode phase
--prio 3 # highest thread priority
Use curl to test response time
time curl --location 'http://127.0.0.1:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Type your prompt here"
}
]
}
],
"chat_template_kwargs": {
"enable_thinking": false
}
}' -s | jq -r
The system is an AmpereOne 128 cores, I configured it in quadrant mode.
Some updates from the latest and greatest llama.cpp ( including the turboquant branches)
numactl --cpunodebind=0 --membind=0 llama-bench -p 2048 -n 256 -r 3 -t 32 -hf mudler/Qwen3.6-35B-A3B-APEX-GGUF
model
size
params
backend
threads
test
t/s
qwen35moe 35B.A3B Q5_K - Medium
23.85 GiB
34.66 B
CPU
32
pp2048
130.84 ± 0.07
qwen35moe 35B.A3B Q5_K - Medium
23.85 GiB
34.66 B
CPU
32
tg256
15.55 ± 0.03
1 Like
@lu_zero Great. You can also take a look at this one for further optimization
opened 02:28PM - 01 Apr 26 UTC
enhancement
server
roadmap
## Overview
We should now be able to add support for disaggregated prefill and … decode to `llama-server`. For explanation of what this does see https://www.perplexity.ai/hub/blog/disaggregated-prefill-and-decode or other articles. In simple terms, prompt processing is performed on dedicated devices/machines and then the decoding step is consolidated in a dedicated device/machine. The main benefit from this is the prompt processing tasks no longer interfere with the low-latency decoding of the sequences and can be automatically distributed to large set of machines in the network.
The main applications are in serving many users, but the functionality also has some interesting single-user use cases. For examples, owners of Mac Studio (fast decode) + DGX Spark (fast prefill) will benefit from this as they would be able to get the best of both worlds.
## Implementation plan
We should already have all the necessary components to support this functionality. I think the initial implementation should be completely achieved with changes to just `llama-server`.
Currently, we initialize a single `llama_model` + `llama_context` for both prefill and decode. We have to generalize this logic and construct more than one `llama_model`. Each new `llama_model` instance will be dedicated for "prefill" tasks and will use a custom list of devices. The list of `llama_model` and the respective devices are configured by the user.
The crucial point here is that the devices can be RPC devices. I.e., remote machines used for prefill will be represented as RPC devices with respective `llama_context`.
Next, we need to support splitting completions tasks into 2 stages: prefill task and generate/decode task. This separation will allow us to queue the prefill tasks to the prefill contexts. When a prefill task is ready, we submit the respective decode task to the main decode context (i.e. the one that we currently have).
When a prefill task is completed, we will store the computed memory (i.e. `llama_state_get_data()`) + prompt in the host memory prompt cache. This way the "decode" task can pick it up and continue decoding. Note that this step of storing the memory buffer to host memory will be performed over the network (for RPC devices), so fast network connections would be quite useful for these disaggregated setups.
I have also tested multi GPU interence between a Ampere server with Mellanox ConnectX-4 NIC and a Asus GX10. You can try this approch if you have many CPU only machines.
1 Like
Once I have enough time I’ll have to experiment with tenstorrent blackhole x AmpereOne hybrid approaches. If you consider interesting I can post some quick benchmarks on their default sets.