Generative AI: why CPU inference is growing

Victor Jakubiuk (@victorj44) Ampere’s Head of AI, was interviewed by LeMagIT of France after the keynotes at KubeCon. The article is pretty long and goes into some detail about when to use GPU’s vs. general purpose CPU.

It is in French, but Google did a pretty good job of translating it.


I can confirm llamafile working on Ampere Q64-30. Cross architecture single file executable with a choice of models. It’s all getting too easy now! I’ll be looking at how llamaindex could be used next.


Learning about AI inference has been fun recently for me!

There are a bunch of trade-offs when using GPU instances for inference:

  • First, if you’re doing something like a recommendation engine connected with an eCommerce site, your traffic will be “bursty” - you will have busy and quiet periods. And you will need to over-provision relative to current demand - according to one person I spoke to in Paris, their GPUs are usually running at about 30% of capacity because the workloads they have do not fill their GPUs. They are heavily over-provisioned.
  • Next, there are a bunch of “pipeline preparation” steps that you can do - the CNCF project Alluxio and other data orchestration projects aim to prepare data for a GPU to use them more effectively.
  • Third, the accuracy of the model may not require 32 bit floating point calculations - often, inference can provide perfectly acceptable results with FP16 or BF16 instead, which are computationally much less intensive.
  • Finally: GPUs are typically not multi-tenant (although Nvidia are working on GPU sharing mechanisms). When you get a GPU instance, you get the whole GPU - they’re expensive! For the money that you spend under-provisioning a GPU instance, you can provision Ampere cores (or any other CPU core) dynamically, you can follow changes in demand with less over-provisioning head-room, and handle the same traffic with a few cores, for much less money.

I would love to learn more! But understanding the cost dynamics around GPUs has really confirmed to me that if you are not running a batch-type workload, or a workload with more predictable load, that GPU instances will probably be more expensive, at the same SLO, than scaling out general purpose Ampere cores (or some other general purpose CPU).

Incidentally, for anyone who has not seen these videos yet, 3blue1brown has a primer series on machine learning and has released 2 videos explaining the underlying mechanisms of GPT and other LLMs at a high level to demystify the “magic” behind them:

The entire video series is spectacular!

1 Like