Inference in ONNX in C?

dylanetaft · October 13, 2025, 6:12pm

I have been experimenting with OnnxRuntime in C, C++, as well as NCNN, and a Yolo v12 image model trained in Pytorch.

One thing I am noticing - the out of the box optimizations don’t quite work for inference on Ampere CPU. Ie, say you have a 1920x1080 image. Your model is trained on 640x640 images. You can either
A) Scale your input and letterbox it, you lose detection resolution
B) Enable dynamic input tensors, letterbox larger input tensors to the 1:1 aspect ratio
C) Slice the larger image into 640x640 windows, and either inference in a loop, or batch.

Focusing on option C…

Ampere CPUs have many cores.
I am noticing that Onnx’s API - SetIntraOpNumThreads - it doesn’t scale infinitely.
Batching seems to be heavily reliant on perhaps SIMD and not threading.

If I batch a group of 640x640 images making up a larger image, I get a 20% performance boost over feeding one at a time.

If I kick off 12 threads - outside of SetIntraOpNumThreads - actually calling Run on separate threads with their own input and output tensors, and run inference on separate threads, I get a 1200% performance boost and better resource utilization.

So far, I am up to 90fps on sz640 on a fp32 yolo model exported to ONNX sized around 10 megs. I haven’t fully optimized it yet or tried quantization or other models.

The Arm Compute Library seems to crash if you try to thread inference manually, despite Onnx saying it is supported. Their default math kernel however seems nearly as optimized as ACL though.

Is this anyone else’s experience - for the best performance, you sort of have to do custom threading with the APIs?

quocbao · October 14, 2025, 12:03am

Do you have code or step-by-step instructions that reproduce the crash you mention?

dylanetaft · October 14, 2025, 4:17am

The short of it is, with openmp add
#pragma omp parallel for
In front of a loop calling Onnx’s Runtime Run while having OMP_NUM_THREADS > 1. It won’t crash if it’s set to 1.
Each thread must have it’s own input and output tensor.

I have not written a test case as there is a lot of boiler plate going on, and the problem went away just avoiding arm compute library. ONNX says ACL is experimental, anyway. I am not sure if the docker build from Ampere uses it, but that docker image is pretty old. This is against git head onnx.

I will post the entire code on github in several days, I am cleaning up the code.
Running multiple instances of inference in parallel against the same Onnx model(yolo v12 in this case)
Seems to cause it to crash inside ACL’s library.

But it’s not impacting me as I just disabled ACL.

I’m more curious about CPU inference on the Ampere platform, any special customizations to get the best performance - and multiple frames of inference being processed in parallel seems to be the biggest one by an order of magnitude - very specifically NOT via multiple batches but actually threading multiple calls for inference one batch at a time. As a general statement that’s probably not doable out of custom code, but it results in inference speed boosts of close to 1000%

dylanetaft · October 14, 2025, 4:24am

To compile ONNX with ACL

You’d build this

And then use OnnxRuntime’s build script

./build.sh --config Release --build_shared_lib --parallel --use_acl --acl_libs /usr/local/lib --acl_home /home/dylanetaft/ComputeLibrary --skip_tests --cmake_extra_defines CMAKE_INSTALL_PREFIX=/usr/local

It doesn’t seem like using ACL is properly documented, but to do so
#include <onnxruntime/acl_provider_factory.h>
And when you are setting up the session

bool enable_fast_math = true;

OrtStatus *status = OrtSessionOptionsAppendExecutionProvider_ACL(session_options, enable_fast_math);

But again, ACL crashes with multithreading outside of what Onnx gives you.

It will crash during one of the thread’s executions for running inference. Oddly always the fourth thread.

For reference - Onnx is thread safe, particular concurrency bugs are noted in various runtime providers, in my case ACL runtime provider. Not mentioned in this thread - I wonder how many people test it
Is InferenceSession.Run thread-safe? · Issue #114 · microsoft/onnxruntime

joneill · October 15, 2025, 8:14pm

Hi @dylanetaft,

Can you provide more detailed instructions? I haven’t used OnnxRuntime before, do they have an example that would demonstrate the behavior you’re seeing?

>for the best performance …

I often find running multiple processes in parallel gets the best throughput performance on Ampere processors if the workload is something that can be run in multiple processes. For instance, video processing has so much better throughput performance vs. threading on every processor I’ve tested and I’d expect Image processing would also as long as you’re not running out of memory as multi-processing has more memory overhead.

best regards,

john

Topic		Replies	Views
Qwen3.5-35B-A3B benchmarks on AmpereOne AI/ML	25	244	April 24, 2026
Optimized TensorFlow for Ampere Content and Articles tensorflow , ampere , ai	5	679	January 3, 2023
Accessing Ampere Optimized (AI) Frameworks on MS Azure Content and Articles tensorflow , ai , azure , pytorch , onnxruntime	3	489	May 24, 2023
Fitting devstral-2, olmo-3 on AmpereOne/AltraMax AI/ML	3	70	February 25, 2026
Ampere Developers Newsletter March 10, 2026 Newsletters	0	29	March 27, 2026

Inference in ONNX in C?

Related topics