Qwen3.5-35B-A3B benchmarks on AmpereOne

lu_zero · February 25, 2026, 7:58pm

I started testing a bit and I’m wondering if I’m doing something wrong:

llama-bench -m .cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -t 96 -p 2048 -n 256 -r 3

model	size	params	backend	threads	test	t/s
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	96	pp2048	125.21 ± 0.23
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	96	tg256	7.34 ± 0.00

quocbao · February 28, 2026, 8:58am

Try reducing the number of threads to around 32. You will hit memory bottleneck if using too many threads.

quocbao · February 28, 2026, 9:09am

This is my result which using only 32 threads. While multiple factors influence CPU inference performance, memory bandwidth remains the most significant bottleneck.

You can also try to compile llama.cpp with BLAS backend

dneary · March 2, 2026, 4:53pm

What result were you expecting? I’m not familiar with how Qwen works under the covers, I’d love to know whether this is particularly bad.

Dave.

lu_zero · March 3, 2026, 10:09am

❯ llama-bench -m .cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -t 32 -p 2048 -n 256 -r 1

model	size	params	backend	threads	test	t/s
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	32	pp2048	89.67 ± 0.00
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	32	tg256	7.78 ± 0.00

build: b68a83e64 (8138)

lu_zero · March 3, 2026, 10:15am

the tg256 is not great, getting to 10+ would make it bearable.

quocbao · March 3, 2026, 10:35am

@lu_zero How many memory sticks did you use? I would suggest using all DDR memory slots to maximize memory bandwidth that the inference process can use.

lu_zero · March 3, 2026, 10:37am

that’s what I did. BLAS reduces the speed further down to ~20t/s for pp2048 and makes no difference for tg256

dneary · March 12, 2026, 9:38pm

Thanks @quocbao, @lu_zero - @bhakti, @jan can either of you help advise Lucas on what he should be seeing and whether there is anything he can do to improve performance?

Dave.

quocbao · March 13, 2026, 1:53am

@lu_zero Have you tried different ANC modes and used numactl to bind llama-bench to a specific NUMA node? I think the key issue here is finding the right balance between the number of CPU cores and the memory locality available to those cores.

quocbao · March 13, 2026, 2:11am

This is my test with ANC mode set to Hemisphere (two NUMA domains). By reducing threads from 32 to 20, I have doubled input throughput. More tests needed to find the best result.

Note that I just use 4 out of 8 memory slots for my Q80-30.

lu_zero · March 14, 2026, 4:46am

the tg256 is still very low sadly, but the efficiency boost is huge!

lu_zero · March 18, 2026, 8:01am

❯ llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 32

model	size	params	backend	threads	test	t/s
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	32	pp2048	92.36 ± 10.54
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	32	tg256	7.04 ± 0.12

build: b68a83e64 (8138)

❯ llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 40

model	size	params	backend	threads	test	t/s
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	40	pp2048	92.58 ± 17.54
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	40	tg256	7.45 ± 0.27

build: b68a83e64 (8138)

❯ llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 80

model	size	params	backend	threads	test	t/s
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	80	pp2048	165.51 ± 1.63
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	80	tg256	6.99 ± 0.13

I tried to set is a Quadrant and those are the results.

quocbao · March 19, 2026, 4:38am

You have to use numactl to bind the llama-bench process to a specific NUMA node.

In the end, based on your benchmark, the command for you to serve the model should look like (assume that you are running Q80-30 in quadrant mode):

numactl --cpunodebind=0 --membind=0 \
llama-server \
   -m unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
   -p 2048 -n 256 -r 3 \
   -tb 80 \ # use 80 threads for prefill phase
   -t 20 \  # only use 20 threads for decode phase
   --prio 3 # highest thread priority

quocbao · March 19, 2026, 4:41am

Use curl to test response time

time curl --location 'http://127.0.0.1:8000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Type your prompt here"
          }
        ]
      }
    ],
    "chat_template_kwargs": {
      "enable_thinking": false
    }
  }' -s | jq -r

lu_zero · March 19, 2026, 1:01pm

The system is an AmpereOne 128 cores, I configured it in quadrant mode.

lu_zero · March 19, 2026, 1:06pm

numactl --cpunodebind=0 --membind=0 llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 32

model	size	params	backend	threads	test	t/s
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	32	pp2048	104.65 ± 0.02
qwen35moe ?B Q4_K - Medium	18.32 GiB	34.66 B	CPU	32	tg256	9.05 ± 0.00

lu_zero · April 19, 2026, 1:15pm

Some updates from the latest and greatest llama.cpp ( including the turboquant branches)

numactl --cpunodebind=0 --membind=0 llama-bench -p 2048 -n 256 -r 3 -t 32 -hf mudler/Qwen3.6-35B-A3B-APEX-GGUF

model	size	params	backend	threads	test	t/s
qwen35moe 35B.A3B Q5_K - Medium	23.85 GiB	34.66 B	CPU	32	pp2048	130.84 ± 0.07
qwen35moe 35B.A3B Q5_K - Medium	23.85 GiB	34.66 B	CPU	32	tg256	15.55 ± 0.03

quocbao · April 20, 2026, 9:46am

@lu_zero Great. You can also take a look at this one for further optimization

github.com/ggml-org/llama.cpp

server : disaggregated prefill/decode support

opened 02:28PM - 01 Apr 26 UTC

ggerganov

enhancement server roadmap

## Overview We should now be able to add support for disaggregated prefill and …decode to `llama-server`. For explanation of what this does see https://www.perplexity.ai/hub/blog/disaggregated-prefill-and-decode or other articles. In simple terms, prompt processing is performed on dedicated devices/machines and then the decoding step is consolidated in a dedicated device/machine. The main benefit from this is the prompt processing tasks no longer interfere with the low-latency decoding of the sequences and can be automatically distributed to large set of machines in the network. The main applications are in serving many users, but the functionality also has some interesting single-user use cases. For examples, owners of Mac Studio (fast decode) + DGX Spark (fast prefill) will benefit from this as they would be able to get the best of both worlds. ## Implementation plan We should already have all the necessary components to support this functionality. I think the initial implementation should be completely achieved with changes to just `llama-server`. Currently, we initialize a single `llama_model` + `llama_context` for both prefill and decode. We have to generalize this logic and construct more than one `llama_model`. Each new `llama_model` instance will be dedicated for "prefill" tasks and will use a custom list of devices. The list of `llama_model` and the respective devices are configured by the user. The crucial point here is that the devices can be RPC devices. I.e., remote machines used for prefill will be represented as RPC devices with respective `llama_context`. Next, we need to support splitting completions tasks into 2 stages: prefill task and generate/decode task. This separation will allow us to queue the prefill tasks to the prefill contexts. When a prefill task is ready, we submit the respective decode task to the main decode context (i.e. the one that we currently have). When a prefill task is completed, we will store the computed memory (i.e. `llama_state_get_data()`) + prompt in the host memory prompt cache. This way the "decode" task can pick it up and continue decoding. Note that this step of storing the memory buffer to host memory will be performed over the network (for RPC devices), so fast network connections would be quite useful for these disaggregated setups.

I have also tested multi GPU interence between a Ampere server with Mellanox ConnectX-4 NIC and a Asus GX10. You can try this approch if you have many CPU only machines.

lu_zero · April 20, 2026, 1:22pm

Once I have enough time I’ll have to experiment with tenstorrent blackhole x AmpereOne hybrid approaches. If you consider interesting I can post some quick benchmarks on their default sets.

Topic		Replies	Views
Fitting devstral-2, olmo-3 on AmpereOne/AltraMax AI/ML	3	70	February 25, 2026
Inference in ONNX in C? AI/ML onnxruntime	4	171	October 15, 2025
New quantization methods for llama.cpp AI/ML	6	585	August 19, 2025
The Future of vLLM (or Similar Projects) for Arm64 CPU Inference? General Discussion cloud , ai	3	199	December 21, 2025
Hosting and scaling LLMs on OKE for production-grade GenAI solutions AI/ML	5	134	December 6, 2024

Qwen3.5-35B-A3B benchmarks on AmpereOne

Related topics