Virtual Meetup on Tuesday 28 October

Hello Everyone,
On Tuesday 28 October, at 9:00 am Pacific time, Noon Easter Time, 17:00 CET

Join the meeting

Meeting ID: 284 915 122 084
Passcode: E7jp7b9h

This virtual meetup presents two technical briefings designed for engineers and architects focused on high-performance AI and SIMD productivity.

Shiva Kintali, Director of Engineering at Ampere

  • Title: A Brief History of AI and its Optimization on Multi-core CPUs
  • Focus: concise tracing of AI algorithmic generations (statistical ML → CNNs/RNNs → Transformers/Diffusion) and the orders-of-magnitude growth in compute. Practical, CPU-first optimization techniques will be demonstrated: model quantization, KV cache and advanced caching patterns, and hardware-aware task scheduling to extract throughput from high-core-count, cloud-native CPUs. Outcome: actionable deployment patterns for large-model workloads on general-purpose hardware.

Konstantinos Margaritis, Founder & CTO of VectorCamp

  • Title: Accelerate SIMD programming inside your IDE
  • Focus: SIMD.info as a distilled ISA knowledge base, SIMD.ai finetuned models that outperform larger models on ISA porting, and a VS Code extension that integrates x86 and Arm SIMD workflows. Outcome: measurable reduction in development time when porting SIMD code and a reproducible workflow inside the IDE.
5 Likes

4 to 5pm GMT so morning PST.

1 Like

yes, forgot to add the time on this post. I will edit it in, in just a moment!

2 Likes

Here is a link to Konstantinos’ learning paths on Arm’s site:

other links:

If you have questions, please add them and we will try to get you some answers

Link to Shiva’s Github

Forgot to join the meeting :joy:

@konstantinos I have received a couple of emails with questions, and I figured that I would post them here to see if you or someone else in the community had an answer:

  • How is this different compared to some libraries that are out there? Like sse to neon GitHub - DLTcollab/sse2neon: A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

  • With regards to the instruction distribution

    • Are those static counts? I assume they are
    • Dynamic counts would be interesting
    • Example: dup instructions are used to initialize vectors before a loop, not inside loop. Intrinsics within the loop are executed maybe 1 million times. I don’t care as much about dup instruction performance compared to the ones in the loop.
  • Other optimizations. Dependencies can really kill performance. Does the tool take care of dependencies?

  • How does LLVM calculate latency and throughput? I use the Ampere compiler guides to get the info :blush:

1 Like

Just want to say thank you for putting this together, I found the topics pretty insightful. It would also be nice if it could run a little longer for networking and chatting. I think these kinds of virtual events are nice for that sort of thing.

1 Like

Hi @Aaron, thank you for organizing this event! It was a great way to meet and talk while also presenting each their work, hope we get more of these in the future!

Now, regarding the questions.

SSE2NEON is a C library to translate code from SSE to (surprise!) NEON. It’s a good library -personally I prefer SIMD Everywhere- but it’s for a different purpose.

Our tool SIMD.ai does not translate the code at 1-1, but it helps to translate the algorithm instead.

It is based on the data that we have organized on our knowledge-base site, SIMD.info. We have manually catalogued more than 10k C intrinsics, Intel SSE4.2/AVX/AVX2/AVX512, Arm Neon, Power VSX and we’re working include other SIMD engines, like Arm SVE2, IBM Z, RISC-V RVV 1.0, Loongson LSX/LASX, MIPS SIMD Architecture (MSA) and also the matrix extensions (AMX, SME2, MMA). This creates an extremely distilled dataset which is fed to our LLM and can respond with greater accuracy than a trillion parameter model like ChatGPT which has to scan all of GitHub and thousands of manuals to get the same quality of information. It’s still not perfect, but for simple to moderately code it can give accuracy between 70-95%, it completely fails for really complicated algorithms, but that’s expected, we don’t want to replace the developer, but assist them in completing the tedious tasks faster. You will not become a SIMD expert by using our tool, but if you are a SIMD expert and know how to use, it can save you a lot of time, at least it works for us, when used with our VSCode extension, Code.SIMD.ai which integrates both platforms into VSCode.

Regarding the distribution statistics:

  • Yes, they are static, we will create new ones next week, we would like to run them on a weekly basis and also show the time evolution.
  • You are correct, most of the time, DUP instructions are used to initialize vectors before the loop. However, I have plenty of algorithms that perform a DUP with a dynamic element -it’s not the same intrinsic, but it’s the same asm instruction. In any case, that was just an example. vqrshrun_n_s16 is definitely used inside a performance critical loop in many places, mostly video codecs. Same holds for other architectures, there are instructions which are found quite frequently in pairs/triplets and that information might perhaps be useful to a CPU designer/architect.
  • I don’t understand what do you mean by dependencies here. We only care about the SIMD code that is given to the tool, the tool will not optimize how you use the C library, Boost or STL for example.
  • There is a tool called LLVM MCA which provides all this information and more, it’s really useful.

I hope I answered your questions adequately.

2 Likes

Hi @konstantinos,

thanks for the detailed answers. I sent those questions to Aaron to post since I didn’t have an account yet.

Sounds like your tool does not just translate instructions, it will take the context (the algorithm) into account. That’s great. Too often does the “just translate” end up with too many not needed instructions.

static vs. dynamic statistics. The DUP instruction was just an example. I have others. I understand that dynamic counts would require to run the code, with the right parameters, etc. That’s a huge amount of work. And probably not feasible for all projects.

With regards to dependencies. I should have been more clear. Those are dependencies with regards to vector register variables. For example, building a sum in a loop. Using always the same vector sum variable will stall because the adds have to wait for the previous add to finish. Using multiple variables will avoid those stalls and utilize both pipelines.

To look at pairs of instructions is very helpful, especially for instructions that can not execute on both pipelines. That’s an instant bottleneck.

2 Likes