Fitting devstral-2, olmo-3 on AmpereOne/AltraMax

There are a bunch of completely/near-completely open models that are very interesting, but documented to work well on gpu.

Can somebody with hardware access try them on the current ampere hardware so there is an additional point of comparison?

Can you clarify which specific models and performance metrics (latency, throughput, correctness,…) you are interested in ?

I guess we can start with:

  • which system can let them run, e.g.: llama.cpp, vLLM, candle-vllm, mistral-rs
    • would they fit 2GB per core or need 4GB per-core
  • are they fast enough compared to the GPU suggest layout
1 Like