Qwen3.5-35B-A3B benchmarks on AmpereOne

I started testing a bit and I’m wondering if I’m doing something wrong:

llama-bench -m .cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -t 96 -p 2048 -n 256 -r 3

model size params backend threads test t/s
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 96 pp2048 125.21 ± 0.23
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 96 tg256 7.34 ± 0.00

Try reducing the number of threads to around 32. You will hit memory bottleneck if using too many threads.

1 Like

This is my result which using only 32 threads. While multiple factors influence CPU inference performance, memory bandwidth remains the most significant bottleneck.

You can also try to compile llama.cpp with BLAS backend

1 Like

What result were you expecting? I’m not familiar with how Qwen works under the covers, I’d love to know whether this is particularly bad.

Dave.

❯ llama-bench -m .cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -t 32 -p 2048 -n 256 -r 1

model size params backend threads test t/s
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 32 pp2048 89.67 ± 0.00
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 32 tg256 7.78 ± 0.00

build: b68a83e64 (8138)

the tg256 is not great, getting to 10+ would make it bearable.

@lu_zero How many memory sticks did you use? I would suggest using all DDR memory slots to maximize memory bandwidth that the inference process can use.

that’s what I did. BLAS reduces the speed further down to ~20t/s for pp2048 and makes no difference for tg256

1 Like

Thanks @quocbao, @lu_zero - @bhakti, @jan can either of you help advise Lucas on what he should be seeing and whether there is anything he can do to improve performance?

Dave.

1 Like

@lu_zero Have you tried different ANC modes and used numactl to bind llama-bench to a specific NUMA node? I think the key issue here is finding the right balance between the number of CPU cores and the memory locality available to those cores.

1 Like

This is my test with ANC mode set to Hemisphere (two NUMA domains). By reducing threads from 32 to 20, I have doubled input throughput. More tests needed to find the best result.

Note that I just use 4 out of 8 memory slots for my Q80-30.

1 Like

the tg256 is still very low sadly, but the efficiency boost is huge!

❯ llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 32
model size params backend threads test t/s
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 32 pp2048 92.36 ± 10.54
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 32 tg256 7.04 ± 0.12

build: b68a83e64 (8138)

❯ llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 40
model size params backend threads test t/s
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 40 pp2048 92.58 ± 17.54
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 40 tg256 7.45 ± 0.27

build: b68a83e64 (8138)

❯ llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 80
model size params backend threads test t/s
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 80 pp2048 165.51 ± 1.63
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 80 tg256 6.99 ± 0.13

I tried to set is a Quadrant and those are the results.

You have to use numactl to bind the llama-bench process to a specific NUMA node.

In the end, based on your benchmark, the command for you to serve the model should look like (assume that you are running Q80-30 in quadrant mode):

numactl --cpunodebind=0 --membind=0 \
llama-server \
   -m unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
   -p 2048 -n 256 -r 3 \
   -tb 80 \ # use 80 threads for prefill phase
   -t 20 \  # only use 20 threads for decode phase
   --prio 3 # highest thread priority

Use curl to test response time

time curl --location 'http://127.0.0.1:8000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Type your prompt here"
          }
        ]
      }
    ],
    "chat_template_kwargs": {
      "enable_thinking": false
    }
  }' -s | jq -r 

The system is an AmpereOne 128 cores, I configured it in quadrant mode.

numactl --cpunodebind=0 --membind=0 llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 2048 -n 256 -r 3 -t 32

model size params backend threads test t/s
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 32 pp2048 104.65 ± 0.02
qwen35moe ?B Q4_K - Medium 18.32 GiB 34.66 B CPU 32 tg256 9.05 ± 0.00
1 Like

Some updates from the latest and greatest llama.cpp ( including the turboquant branches)

numactl --cpunodebind=0 --membind=0 llama-bench -p 2048 -n 256 -r 3 -t 32 -hf mudler/Qwen3.6-35B-A3B-APEX-GGUF

model size params backend threads test t/s
qwen35moe 35B.A3B Q5_K - Medium 23.85 GiB 34.66 B CPU 32 pp2048 130.84 ± 0.07
qwen35moe 35B.A3B Q5_K - Medium 23.85 GiB 34.66 B CPU 32 tg256 15.55 ± 0.03
1 Like

@lu_zero Great. You can also take a look at this one for further optimization :smiley:

I have also tested multi GPU interence between a Ampere server with Mellanox ConnectX-4 NIC and a Asus GX10. You can try this approch if you have many CPU only machines.

1 Like

Once I have enough time I’ll have to experiment with tenstorrent blackhole x AmpereOne hybrid approaches. If you consider interesting I can post some quick benchmarks on their default sets.