I remember early in my career working for the Retail group of a large company and I started learning about what was the “Best” price for a product. And the answer was always, “it depends, what is your goal? Sell more? Sell more of a different product (many products are linked together)? Help out your supplier?” And there are many, many other trade offs. IoT has the same issue with Real Time. Do you need the temperature once a minute? Once a second? What about speed? You can get the speed of a vehicle at 2000 times a second. But do you need it? Can the mobile network handle that much data? And do you want to pay for it?
I think of all of this when I see all of the hype in AI/ML that is currently happening. “We need supper fast GPUs to do AI” Do you need it for everything? They are a lot more expensive and use a lot more energy.
Tony Rigoni wrote a blog post about this and he gives you three things to think about when deploying AI:
Deploy only the amount of compute you need to meet the performance requirements of your application and use general purpose rather than specialized processors as broadly as possible to maintain flexibility for future compute needs.
Switch AI CPU-only inferencing from legacy x86 processors to Cloud Native Processors. With the performance, you may be able to deploy as CPU-only for a wider range of AI workloads than with legacy x86 processors.
Combine GPU with the power efficient processors for heavier AI training or LLM inferencing workloads.
Basically, use the more efficient processors where you can and save specialized processors for when you need them.
Check out his full post here:
And BTW while there might not be a “best price”, there is a worse price. $1.06, followed by anything ending in .06 . You will sell a lot more at 99 cents and you won’t sell less at $1.09.
GPUs are indeed the only viable option for most AI training workloads and specifically NVIDIA GPUs, which is a troubling fact for the users given supply shortages in this huge demand and the fact that any monopolies get to dictate prices.
The current virality of AI unfortunately results in the discourse often being conducted without reliance on hard facts. While it’s impossible to state exactly what proportion of AI inference is ran on particular types of hardware, various independent consulting agencies have estimated that roughly 55-65% of all AI deployments run on CPU-only with the rest being split between GPUs, FPGAs, and ASICs.
CPUs provide sufficient performance for handling majority of the AI inference workloads, are the most cost-efficient solution, and, as general purpose compute, offer the ability to handle other workloads beyond AI. With cloud customers reporting the underutilization of compute resources as the main issue of their deployments, underutilization of expensive GPU instances dedicated specifically and virtually only for AI tasks exacerbates the issue and drives many to switch to on-premise deployments resulting in the first year ever when the rate of growth of cloud workloads slowed down.
For the inference workloads which require higher parallel computing power, which is currently available only to GPUs, model optimizations and advancements in alternative hardware are the most likely route to ensure that all developers can access the resources they require in order to deploy their AI applications. For the moment this issue particularly affects generative AI workloads which require very high compute power for any latency sensitive deployments. However, the efforts in model optimization have already resulted in the newest iteration of the LLaMa model exhibiting promising performance on CPU-only and deployments of models such have for a while now overperformed their GPU counterparts.