I noticed that Ampere optimized llama.cpp recommends using two new model quantization methods, Q4_K_4 and Q8R16. Has anyone tried both quantization types? If so, could you share how they performed?
Hi Binh, I tried pull these two model quantizations in Ollama, seems Q4_k_4 inference is faster than Q8R16, maybe because Q4_k_4 leverages a significantly smaller memory
Thanks, Lily. I just tried out Q4_K_4 in my environment, and compared to Q4_K_M, it achieves faster TPS for both PP and TG. The new quantization method seems really promising.
When you run Q4_K_4 quantized model, the file type is “Q4_K tiled by 4 rows”:
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K tiled by 4 rows
And “Q8 tiled by 16 rows” for Q8R16 quantized model.
Did someone try the current upstream and compared its performance?
For your reference, I compared the latest Ampere-optimized llama.cpp v3.2.0 with the upstream llama.cpp on an AltraMax M128-30 using Llama 3.1 8B in Q8. The Ampere-optimized build achieved roughly 2.1x total token throughput compared with upstream. By the way, which model and quantization are you using?
I’ve been reading a bit into this stuff, with the help of some “Llama 3 deep dive” articles (linked below) - it’s interesting! The Q8R16 quantization means that you quantize weights to 8 bits, but only quantize transformer layers (the attention heads and the feed forward network) to 16 bits.
A Large Language Model, essentially, is made up of 4 main parts:
- A prompt is tokenized and embedded (each token maps to a number in the dictionary of possible tokens - 128K entries for Llama 3 8B, then mapped to a large-dimension vector space - 4096 dimensions for Llama 3 8B) - the embeddings by default encapsulate the “meaning” of a token - so “-ing” as a token can translate to a vector meaning an action related to a preceding verb, for example) - the embeddings also include a position-related component.
- The vector is then multiplied by a matrix to project it to a smaller vector space, which represents how much other tokens in the context window affect the value (read: meaning) of the token we’re processing. For example, let’s say that the token before “-ing” is “fish” - that clearly impacts the meaning of the “-ing” - we’re hunting for fish, or poking around for information, or we’re an adjective about to describe an object like a rod or a hook. It all depends on the other tokens in the context window. This is the attention step.
- After finishing the attention step, the embedding vector of our current token is updated in the feed-forward network (FFN), to integrate all of the other tokens that affect its meaning in the context window, generating a new, updated vector with all of the pertinent information from the context window in its value. For example, the token “hello” from “Stop, you had me at hello” (Jerry Maguire) would be updated with the context of all of the dialogue before it, we know that it’s the end of a sentence, and that the meaning of “hello” in this sense includes sentiments of reconciliation, forgiveness, etc… In one 4096-dimensional vector, we now have the entire context of what came before “hello” included in its value.
- There is not one iteration of this - Llama 3 8B includes 32 attention heads, each of which is doing a similar job, integrating attention and meaning along different directions.
- After completing 32 iterations of the transformer layer, we finally generate an output vector which condenses all of this context and information into a single 4096-dimension vector, which we now use to predict the next token. We go from our 4096-dimension vector to a probability distribution vector for our dictionary, representing the probabilities that each token in our dictionary will come next. Based on the temperature setting of the model, we will choose from the top 3-4 most likely, and feed the prompt with the new token back through the machine (caching all the information that we can in the KV-cache).
If I understand correctly (and there’s a good chance I don’t), the “weights” matrices in the 32 attention heads (the Q, K, and V projection layers, and the FFN outputs), plus the output vector from the final layer, are quantized to 8 bit floating point, and the “residual streams” (that is, the working data of the model as tokens are being processed), embedding layers and all of the layer normalization matrices that are part of the process of generating vectors scaled to a certain min-max are quantized to 16 bits. I’m not sure how that affects the model in memory, though - or how it impacts the model’s accuracy. If anyone has insights, I’d love to know more!