While vLLM is primarily GPU-driven, given the rise of highly optimized small LLMs and quantized models, do you think there’s a future for vLLM (or a similar project) to offer highly efficient CPU inference backends, especially on powerful Arm64 server CPUs like Ampere One M? What would be the challenges?
The number of AI projects like libraries/Frameworks/MLOps is growing rapidly. Can you share more information about them?
I am all for it! One of vLLM’s slogans (“Any model, any hardware, any cloud”) certainly fits. Right now I believe the Arm64 CPU implementation is not compelling - FP32 for working data, and very little SIMD acceleration - and some of its dependencies do not have Ampere performance improvements - but I hope that changes very soon!
I’m seeing already a problem with the llama.cpp improvement from ampere not landing upstream, making hard to keep up their improvement and their extended model support.
vLLM would have even larger problems since it is so far more focused on cuda and just cuda.