Generative AI: why CPU inference is growing

Aaron · April 11, 2024, 2:50pm

Victor Jakubiuk (@victorj44) Ampere’s Head of AI, was interviewed by LeMagIT of France after the keynotes at KubeCon. The article is pretty long and goes into some detail about when to use GPU’s vs. general purpose CPU.

It is in French, but Google did a pretty good job of translating it.

vikingforties · April 19, 2024, 11:41am

I can confirm llamafile working on Ampere Q64-30. Cross architecture single file executable with a choice of models. It’s all getting too easy now! I’ll be looking at how llamaindex could be used next. https://github.com/Mozilla-Ocho/llamafile

dneary · April 19, 2024, 7:12pm

Learning about AI inference has been fun recently for me!

There are a bunch of trade-offs when using GPU instances for inference:

First, if you’re doing something like a recommendation engine connected with an eCommerce site, your traffic will be “bursty” - you will have busy and quiet periods. And you will need to over-provision relative to current demand - according to one person I spoke to in Paris, their GPUs are usually running at about 30% of capacity because the workloads they have do not fill their GPUs. They are heavily over-provisioned.
Next, there are a bunch of “pipeline preparation” steps that you can do - the CNCF project Alluxio and other data orchestration projects aim to prepare data for a GPU to use them more effectively.
Third, the accuracy of the model may not require 32 bit floating point calculations - often, inference can provide perfectly acceptable results with FP16 or BF16 instead, which are computationally much less intensive.
Finally: GPUs are typically not multi-tenant (although Nvidia are working on GPU sharing mechanisms). When you get a GPU instance, you get the whole GPU - they’re expensive! For the money that you spend under-provisioning a GPU instance, you can provision Ampere cores (or any other CPU core) dynamically, you can follow changes in demand with less over-provisioning head-room, and handle the same traffic with a few cores, for much less money.

I would love to learn more! But understanding the cost dynamics around GPUs has really confirmed to me that if you are not running a batch-type workload, or a workload with more predictable load, that GPU instances will probably be more expensive, at the same SLO, than scaling out general purpose Ampere cores (or some other general purpose CPU).

Incidentally, for anyone who has not seen these videos yet, 3blue1brown has a primer series on machine learning and has released 2 videos explaining the underlying mechanisms of GPT and other LLMs at a high level to demystify the “magic” behind them:

Part 5: “What is GPT? A visual introduction to transformers”: https://www.youtube.com/watch?v=wjZofJX0v4M
Part 6: “Attention in transformers, visually explained”: https://www.youtube.com/watch?v=eMlx5fFNoYc

The entire video series is spectacular!

Topic		Replies	Views
Exploring Ampere's Potential for Gen AI Applications General Discussion	1	73	August 27, 2024
Ampere AI + Matoha Case Study: AI Training on CPU Instances Alone AI/ML	2	636	February 21, 2023
Hosting and scaling LLMs on OKE for production-grade GenAI solutions AI/ML	5	62	December 6, 2024
Weekend Read - What is the best processor for AI? AI/ML ai , weekend-read	1	378	October 4, 2023
Ampere Developer Summit- Teasers General Discussion	0	37	September 20, 2024

Generative AI: why CPU inference is growing

Related topics