-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOTA 3-bit quants #5196
SOTA 3-bit quants #5196
Conversation
RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more.
PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size.
We have PP-512: 5891 t/s TG-128: 143.9 t/s
Metal performance is decent, ARM_NEON is pathetic
Build fails on ROCm.
|
PPL 2.31 bpw got 5.822 for mistral7B, but mistral7b IQ3_XXS has 6.0578 |
Dot product still fails. Is this real?
I have always been very careful to state the context length of a PPL result. When I published the 2-bit quants, I did state PPL's for context of 4096 because this allowed direct comparison with PPL values from a recent paper claiming SOTA performance. Sorry that this is causing a confusion. |
@Artefact2 Does the change I pushed fix it? I don't have an AMD card to test. I have used |
The build are fine on windows with ROCM 5.7.1 now. |
This time the dot product accuracy did find an actual bug in the AVX2 implementation.
Are the number of columns in the ffn_down tensor stil required to be a multiple of 256? can we get some support on 128 instead in addition to 256? If not what could be issues? |
yup, I tried it. it's still 256. we need quant that could do multiple of 128 instead fallback to legacy quants. |
Hello. Is it possible to use the techniques from the Additive Quantization for Language Models (AQLM) paper? It seems to have excellent results. |
More excellent than these results? From the quoted paper, table 2 (3-bit quantization, which is the subject of this PR), I see WikiText perplexities of 5.46, 4.83, and 3.36 for the LLaMA-v2 models. The authors of such papers always use a context of 4096 (even if not mentioned explicitly in the text, one can deduce it from the PPL of the What is difference between what they have done and what is being done here? Basically:
So, in short, I selected the approach in this PR for a) inference performance reasons (D4-lattice instead of E8, which is theoretically better), and b) for practicality ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add IQ3_XXS
to test-backend-ops
like this:
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 55ce14e0..3eec0554 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -1481,6 +1481,7 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
GGML_TYPE_Q4_K, GGML_TYPE_Q5_K,
GGML_TYPE_Q6_K,
GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS,
+ GGML_TYPE_IQ3_XXS,
};
// unary ops
The
Might want to take a look and if there is no way to make them match, we should probably disable the It's strange, because I'm pretty sure that earlier when I wrote to enable the tests, they passed on my Mac, but now they are failing and there haven't been any changes since then |
I cannot find any of the new SOTA 2 or 3 bit quants (e.g, IQ3_XXS or Q3_K_XS) available yet when i checked ➜ llama.cpp git:(master) quantize
usage: quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] model-f32.gguf [model-quant.gguf] type [nthreads]
--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
--pure: Disable k-quant mixtures and quantize all tensors to the same type
Allowed quantization types:
2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B
3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B
8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B
9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B
10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B
12 or Q3_K : alias for Q3_K_M
11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B
12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B
13 or Q3_K_L : 3.35G, +0.1764 ppl @ LLaMA-v1-7B
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B
15 or Q4_K_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 4.33G, +0.0400 ppl @ LLaMA-v1-7B
17 or Q5_K_M : 4.45G, +0.0122 ppl @ LLaMA-v1-7B
18 or Q6_K : 5.15G, -0.0008 ppl @ LLaMA-v1-7B
7 or Q8_0 : 6.70G, +0.0004 ppl @ LLaMA-v1-7B
1 or F16 : 13.00G @ 7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing any idea how to quantize using IQ3_XXS ? |
ikawrakow listed the commands in the previous pull request for SOTA 2 bit quants:
I am guessing you replace Edit: I tried to use the method and it showed a perplexity of about 3200 for the model Edit #2: It turns out that you need to use a .imatrix extension rather than .dat now. |
* iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <[email protected]>
* iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <[email protected]>
TL;DR
This PR adds "true" 3-bit quants (3.0625 bpw due to block structure) as
IQ3_XXS
. Both, model size and perplexity are lower compared toQ3_K_XS
Details
Table shows a comparison between
IQ3_XXS
added by this PR andQ3_K_XS
for several models. Sizes are in GiB, perplexity is for a context of 512 tokens and uses an importance matrix fromwiki.train.raw
.Even though nobody uses LLaMA-v1 these days, I have added the results in view of the fact that early quantization work in this repository did happen using LLaMA-v1. When k-quants were first released in PR #1684, the smallest quantized model at the time was
Q2_K
with a size of 2.67 GiB and perplexity of 6.773.To avoid confusion with the PPL tables in the 2-bit quant PRs, here is a table of PPL values for the new
IQ3_XXS
quantization type for a context of 4096 tokens:How
Approach follows in the footsteps of
IQ2_XXS
andIQ2_XS
(#4773, #4856, #4897):IQ2_XXS
andIQ2_XS
it is the E8-lattice for groups of 8 quants, here it is the D4-lattice (https://en.wikipedia.org/wiki/16-cell_honeycomb) for groups of 4 quants8 x 8 + 4 x 7 = 92
bits to encode quant magnitudes and signs. This leaves 4 spare bits for an unsigned block scale to use exactly 3 bpw. The super-blockfp16
scale needs another 16 bits per super-block of 256, so we end up using 3.0625 bpw.