-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: gemma 2 27B GGML_ASSERT n_dims <= ne0 #8246
Comments
I'm experiencing the exact same issue. Though I'd classify it as a critical issue, as having llama.cpp constantly crash is quite detrimental. For me the issue only started after the logit soft-capping was merged, before that there was no crashing, but obviously the generation quality was way lower before that fix. And in case it matters I'm on Windows 11. |
Having a similar issue with b3280 llama.cpp (llama-cli.exe, Cu12, 64bit) |
On Debian 12 GGML_ASSERT: ggml/src/ggml.c:13968: n_dims <= ne0 ptrace: Operation not permitted.ptrace: Operation not permitted. No stack. No stack.No stack. The program is not being run. The program is not being run. on master branch, commit f8c4c07 |
Find the last commit that works |
#8156 Add support for Gemma2ForCausalLM crashes as well To reproduce, used 8k context and 74k prompt. Larger ctx values improves stability somewhat. make clean && make GGML_CUDA=1 GGML_LTO=1 -j ./llama-cli -m /work/models/misc/gemma-2-27b-it-Q6_K_L.gguf -t 6 --color --interactive --conversation --multiline-input --mirostat 2 --ctx-size 8192 --n-gpu-layers 12 --keep -1 model from https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/tree/main |
#8156 seems to work for me but the output quality is bad. Another note: I have an rtx 3060 12gb and a gtx 1660 super 6gb. Maybe it has something to do with multiple gpus but I'm not sure (guessing based on to identical error lines). I'm on Windows 10. @derjoshder have similar setup like me. |
I only have one GPU the RTX 3080 10GB. So I don't think it's multi-GPU related. It's also worth mentioning that I'm mostly running the model on CPU, since I don't have a lot of VRAM. |
Apparently this issue has made it downstream. I'm getting it in KoboldCPP (utilizes llama.cpp internally) on my AMD 7800XT (using HIPBLAS.) Also single GPU, though of course I'm not offloading all layers. I checked the ggml.c file it references and this is the function the crash points to:
Specifically it crashes on the "const float x1" line. The crash always points to the same ggml.c file on the exact same line every time. For me it happens in a pretty specific way every time too. I'm running Gemma 2 27B Q4_K_L (I also tried Q4_K_M originally then switched to this with exactly the same result) at 16384 context and I might get a few rare crashes as early as 10K in the context as I go, but very rare at that point. Then, when I finally fill the full 16K it crashes maybe every second generation or so. I'm connecting to the API with SillyTavern generally and I don't know if it doesn't erase back context or what, but I can start a new chat and it may still crash nearly immediately. I have to fully close things out and reload and then it stops crashing until I reach a higher context again, so I do think this all has something to do with context in some form. I double checked my model file, even completely redownloading the whole thing and the md5sum checks out the same (as a side note, it sure would be nice if sites like HuggingFace and co listed checksums so one could verify in a more intelligent way than just downloading multiple times and checking sums of each.) Not sure if this directly helps, but at least it does confirm the issue can happen in different setups, including AMD. I'm on Manjaro 24.0.3. EDIT: Upon someone's suggestion I disabled the context shift feature of KoboldCPP to see what would happen. This seems to have stopped the crashes, though it also results in having to process the full prompt almost every time too unfortunately. (Very slow on my hardware unfortunately.) Interestingly, with context shift disabled, it seems to be limiting context now to less than the maximum. It generally limits to around 14.2K or so (out of my 16) but when I hit a global key or similar it can get as high as 15.8K -- just under 16. Could this be something like maybe estimating tokens wrong and overflowing? I guess the rest of you aren't using context shifting since you're on llama.cpp. I wonder what happens if you adjust your maximum outputs so they are less than what you've set llama.cpp to load? Eg something like 6K or so for you 8K users. |
Check if #8348 fixes the issue |
Looks good |
What happened?
I got the error using different quants from different authors. After asking llm a few times, llama.cpp crashed with these two lines:
Name and Version
Tested using b3266 and b3276, same result.
What operating system are you seeing the problem on?
Windows
Relevant log output
The text was updated successfully, but these errors were encountered: