Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : do not use ARM features not included in the build #10457

Merged
merged 1 commit into from
Nov 23, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Nov 22, 2024

Fixes #10435

@slaren slaren merged commit 55ed008 into master Nov 23, 2024
55 checks passed
@slaren slaren deleted the sl/fix-arm-features branch November 23, 2024 13:41
@gustrd
Copy link
Contributor

gustrd commented Dec 3, 2024

Excuse me @slaren , is it possible to do the same for the Android ARM build? I observed the same regression with q4_0_4_4 at the Snapdragon8G1.

@slaren
Copy link
Collaborator Author

slaren commented Dec 4, 2024

I am not aware of any issues with Android, this should work in the same way with all the ARM platforms.

@gustrd
Copy link
Contributor

gustrd commented Dec 4, 2024

I recently built the latest version of llama.cpp on Android and noticed a significant slowdown when using the q4_0_4_4 quantization format. It seems this slowdown occurs because the new version automatically converts q4_0_4_4 to the q4_0_4_8 format.

However, when I use the new IQ4_NL format, performance remains fast. From what I can tell, the automatic conversion in this case transforms it back into into a similar to q4_0_4_4.

@slaren
Copy link
Collaborator Author

slaren commented Dec 4, 2024

This is not correct, only Q4_0 (and IQ4_NL) is converted to other types.

@gustrd
Copy link
Contributor

gustrd commented Dec 4, 2024

You're absolutely right! When I run with q4_0, it gets repacked into the correct format, and the token generation speed is consistent.

I'm not sure why q4_0_4_4 has become slower, but it doesn’t seem like a major issue since it’s no longer necessary. Perhaps adding a deprecation warning for q4_0_4_4 could help clarify this for users.

That said, I did notice some performance loss during prompt processing compared to the old version using q4_0_4_4. I plan to benchmark this further and open a separate issue to provide more details. Thanks again for your hard work and patience—it’s greatly appreciated!

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants