I want to share this as an honest account rather than a promotional post because I was genuinely skeptical going into this migration and I think the skepticism was reasonable given how dominated the conversation around AI inference hardware has been by Nvidia for the past few years.
We run a modest but growing machine learning inference setup for a B2B SaaS product that does document classification and entity extraction for enterprise clients. For the first two years we were running everything on x86 based instances and the performance was acceptable but the power consumption and associated costs were becoming harder to justify as we scaled.
About four months ago we decided to test migrating a portion of our inference workload onto Ampere Altra based instances after a colleague at another company mentioned the performance per watt difference was significant enough to meaningfully affect their infrastructure budget.
The migration itself was less painful than I expected. Most of our PyTorch based inference code ran without modification and the ARM compatibility concerns I had going in turned out to be largely a non issue for our specific stack. The container images needed rebuilding for ARM64 which took a day of work but nothing beyond that was a serious obstacle.
Where things got interesting was in the actual power consumption metrics after running both environments in parallel for six weeks. The efficiency difference was meaningful enough that we are now actively planning to migrate the remaining workload over the next quarter.
The one area where I have questions for the community is around memory bandwidth utilization specifically. Our models are not particularly large but they are memory access heavy by nature and I want to make sure we are configuring the instances correctly to take full advantage of what the platform offers rather than leaving performance on the table.
Has anyone done detailed memory bandwidth optimization work on Ampere Altra instances for inference workloads and found configuration changes that made a measurable difference? Also been looking at on premise Ampere based hardware options on platforms like Ebay, Amazona and Etech Devices for a potential hybrid setup down the road.