Moved Our Inference Workload to Ampere and the Power Efficiency Numbers Are Hard to Argue With

SaraWill · June 12, 2026, 3:15am

I want to share this as an honest account rather than a promotional post because I was genuinely skeptical going into this migration and I think the skepticism was reasonable given how dominated the conversation around AI inference hardware has been by Nvidia for the past few years.

We run a modest but growing machine learning inference setup for a B2B SaaS product that does document classification and entity extraction for enterprise clients. For the first two years we were running everything on x86 based instances and the performance was acceptable but the power consumption and associated costs were becoming harder to justify as we scaled.

About four months ago we decided to test migrating a portion of our inference workload onto Ampere Altra based instances after a colleague at another company mentioned the performance per watt difference was significant enough to meaningfully affect their infrastructure budget.

The migration itself was less painful than I expected. Most of our PyTorch based inference code ran without modification and the ARM compatibility concerns I had going in turned out to be largely a non issue for our specific stack. The container images needed rebuilding for ARM64 which took a day of work but nothing beyond that was a serious obstacle.

Where things got interesting was in the actual power consumption metrics after running both environments in parallel for six weeks. The efficiency difference was meaningful enough that we are now actively planning to migrate the remaining workload over the next quarter.

The one area where I have questions for the community is around memory bandwidth utilization specifically. Our models are not particularly large but they are memory access heavy by nature and I want to make sure we are configuring the instances correctly to take full advantage of what the platform offers rather than leaving performance on the table.

Has anyone done detailed memory bandwidth optimization work on Ampere Altra instances for inference workloads and found configuration changes that made a measurable difference? Also been looking at on premise Ampere based hardware options on platforms like Ebay, Amazona and Etech Devices for a potential hybrid setup down the road.

vikingforties · June 12, 2026, 2:16pm

Hi Sara, just checking you’re making use of the AIO software acceleration layer as well?

Topic		Replies	Views
Exploring Ampere's Potential for Gen AI Applications General Discussion	1	172	August 27, 2024
Optimized TensorFlow for Ampere Content and Articles tensorflow , ampere , ai	5	696	January 3, 2023
Ampere AI + Matoha Case Study: AI Training on CPU Instances Alone AI/ML	2	681	February 21, 2023
Could Someone Give me Advice for Optimizing Performance on Ampere Altra Systems? General Discussion oci , ampere	6	378	August 19, 2024
Feb 2026: New OCI Instance, Tuning Java Applications, Performance Methodology Newsletters	0	34	February 19, 2026

Moved Our Inference Workload to Ampere and the Power Efficiency Numbers Are Hard to Argue With

Related topics