-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomic Vulkan backend #4456
Nomic Vulkan backend #4456
Conversation
…nse (SOM), version 1.0.
should no longer have new external deps other than libvulkan ``` ubuntu@ip-172-31-1-24:~/repo/gpt4all/gpt4all-backend/build$ ldd ./libllamamodel-mainline-avxonly.so linux-vdso.so.1 (0x00007ffcb53bb000) libvulkan.so.1 => /lib/x86_64-linux-gnu/libvulkan.so.1 (0x00007f239dab5000) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f239d800000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f239d719000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f239da95000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f239d400000) /lib64/ld-linux-x86-64.so.2 (0x00007f239dd1d000) ```
… and stop using so many static objects so we can tear down and bring up vulkan on new devices in the same runtime.
There are some warnings in debug builds that are likely to be false positives.
… for all kernels.
@ggerganov My concern is that a call to
While I understand that this is not very high priority, the fix should be very simple (just move the initialization calls to the backend code), so I don't really see any reason to not fix it before merging. |
Right now only Falcon and Llama are whitelisted. This backend is strict about the model format because it is designed to run a contiguous graph on the GPU (like Metal), instead of only offloading the ops that are implemented (like Occam's Vulkan PR). |
At this point my backend also runs the full graph on GPU contiguously (if all layers are offloaded). |
The previous attempt actually broke GPU inference with the 'main' example, which was previously working. deviceName is a vk::ArrayWrapper1D. Be careful when we convert it to a std::string, so we don't get null bytes at the end.
ggml_vk_graph_compute: error: unsupported op 'MUL_MAT' I get this error using kompute backend, on Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] (rev 02) |
What model are you using? Maybe the fallback code got broken among the recent changes (edit: yes, it did) - it sounds like you are using an unsupported quantization type. Also, Intel GPUs are not currently known to work, and I don't know how well integrated GPUs work in general at the moment. |
I am using mistral-7b-instruct-v0.2.Q5_K_M.gguf |
I can reproduce on Alder Lake (Iris Xe) |
I tried with different quantization mistral-7b-instruct-v0.2.Q4_K_M.gguf but get the same error ggml_vk_graph_compute: error: unsupported op 'MUL_MAT' |
This backend currently only supports Q4_0, Q4_1, and F16 quantizations. The latest master of llama.cpp will at least fall back to CPU in this case instead of failing. |
It would be nice if it can fallback to the other vulkan backend. |
Signed-off-by: Jared Van Bortel <[email protected]> Co-authored-by: niansa <[email protected]> Co-authored-by: Adam Treat <[email protected]> Co-authored-by: Aaron Miller <[email protected]> Co-authored-by: ToKiNoBug <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: slaren <[email protected]>
I am really happy the Kompute implementation finally made it into mainline llama.cpp! |
Nomic appears to claim support for Qualcomm GPUs by "Nomic Vulkan": Perhaps that is different from "Nomic's Kompute based Vulkan backend", that is not exactly clear. But what is clear is that this backend can't run on Qualcomm GPUs at all because it wants uniformAndStorageBuffer (8 and 16bit) access, which their Vulkan driver does not show as supported (on any of their GPUs). FYI, it also can't run on ARM GPUs, because they have maximum subgroup size of 16 (even in the top of the line Immortalis chips), and this backend wants 32. Any chance these dependencies could be worked around, @cebtenzzre ? |
Signed-off-by: Jared Van Bortel <[email protected]> Co-authored-by: niansa <[email protected]> Co-authored-by: Adam Treat <[email protected]> Co-authored-by: Aaron Miller <[email protected]> Co-authored-by: ToKiNoBug <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: slaren <[email protected]>
This is Nomic's Kompute-based Vulkan backend from the GPT4All project, now available under the MIT license. It can be enabled by building with cmake and passing -DLLAMA_KOMPUTE=ON (make is currently not supported).
Structure
Limitations
There is currently no partial offload support, so it is either(Partial offloading is now implemented.)-ngl 1
or-ngl 0
, like Metal. We plan to implement this eventually, by implementing a split point in the compute graph. We do not plan to implement per-op offload (what most backends, do, including 0cc4am's Vulkan backend).GPU-accelerated matmul for prompt processing is currently disabled due to a known issue with incorrect output. Token generation runs 100% on the GPU.(GPU prompt processing has been fixed and re-enabled.)This PR still needs to be updated for the per-layer KV cache.Done a while ago.Contributions are welcome! This backend currently works well enough for Nomic's purposes so at the moment we are not focused on adding new features, only maintenance - our team is quite small right now. The goal of this PR is to reduce the maintenance burden for us, as llama.cpp frequently introduces changes that affect all backends.