I have been experimenting with OnnxRuntime in C, C++, as well as NCNN, and a Yolo v12 image model trained in Pytorch.
One thing I am noticing - the out of the box optimizations don’t quite work for inference on Ampere CPU. Ie, say you have a 1920x1080 image. Your model is trained on 640x640 images. You can either
A) Scale your input and letterbox it, you lose detection resolution
B) Enable dynamic input tensors, letterbox larger input tensors to the 1:1 aspect ratio
C) Slice the larger image into 640x640 windows, and either inference in a loop, or batch.
Focusing on option C…
Ampere CPUs have many cores.
I am noticing that Onnx’s API - SetIntraOpNumThreads - it doesn’t scale infinitely.
Batching seems to be heavily reliant on perhaps SIMD and not threading.
If I batch a group of 640x640 images making up a larger image, I get a 20% performance boost over feeding one at a time.
If I kick off 12 threads - outside of SetIntraOpNumThreads - actually calling Run on separate threads with their own input and output tensors, and run inference on separate threads, I get a 1200% performance boost and better resource utilization.
So far, I am up to 90fps on sz640 on a fp32 yolo model exported to ONNX sized around 10 megs. I haven’t fully optimized it yet or tried quantization or other models.
The Arm Compute Library seems to crash if you try to thread inference manually, despite Onnx saying it is supported. Their default math kernel however seems nearly as optimized as ACL though.
Is this anyone else’s experience - for the best performance, you sort of have to do custom threading with the APIs?
The short of it is, with openmp add #pragma omp parallel for
In front of a loop calling Onnx’s Runtime Run while having OMP_NUM_THREADS > 1. It won’t crash if it’s set to 1.
Each thread must have it’s own input and output tensor.
I have not written a test case as there is a lot of boiler plate going on, and the problem went away just avoiding arm compute library. ONNX says ACL is experimental, anyway. I am not sure if the docker build from Ampere uses it, but that docker image is pretty old. This is against git head onnx.
I will post the entire code on github in several days, I am cleaning up the code.
Running multiple instances of inference in parallel against the same Onnx model(yolo v12 in this case)
Seems to cause it to crash inside ACL’s library.
But it’s not impacting me as I just disabled ACL.
I’m more curious about CPU inference on the Ampere platform, any special customizations to get the best performance - and multiple frames of inference being processed in parallel seems to be the biggest one by an order of magnitude - very specifically NOT via multiple batches but actually threading multiple calls for inference one batch at a time. As a general statement that’s probably not doable out of custom code, but it results in inference speed boosts of close to 1000%
It doesn’t seem like using ACL is properly documented, but to do so #include <onnxruntime/acl_provider_factory.h>
And when you are setting up the session
Can you provide more detailed instructions? I haven’t used OnnxRuntime before, do they have an example that would demonstrate the behavior you’re seeing?
>for the best performance …
I often find running multiple processes in parallel gets the best throughput performance on Ampere processors if the workload is something that can be run in multiple processes. For instance, video processing has so much better throughput performance vs. threading on every processor I’ve tested and I’d expect Image processing would also as long as you’re not running out of memory as multi-processing has more memory overhead.