Replies: 1 comment
-
That's the line I comment out to get it working. I comment that line out and build it again with rpc support and cuda. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I try to run model with RPC and --flash-attn --cache-type-k q8_0 --cache-type-v q8_0, on rpc machine run:
build-rpc-cuda/bin/rpc-server --host 0.0.0.0 --port 50052
on server run:
build-rpc-cuda/bin/llama-server --port 8989 -m ../llama_cpp/Qwen2.5-32B.Q8_0.gguf -p "Hello, you are coder assistant" -ngl 48 --n-predict -1 --ctx-size 8192 --threads 4 --no-mmap --temp 0.3 --rpc 192.168.3.2:50052 --tensor-split 4,24,20 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0
and get:
`common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/media/yesh/wd/AI/llama.cpp/ggml/src/ggml-rpc.cpp:467: GGML_ASSERT(tensor->ne[0] % 512 == 0 && "unsupported quantized tensor") failed
[New LWP 4401]
[New LWP 4408]
[New LWP 4409]
[New LWP 4410]
[New LWP 4411]
[New LWP 4412]
[New LWP 4413]
[New LWP 4414]
[New LWP 4415]
[New LWP 4416]
[New LWP 4417]
[New LWP 4418]
[New LWP 4419]
[New LWP 4420]
[New LWP 4421]
[New LWP 4422]
[New LWP 4423]
[New LWP 4424]
[New LWP 4425]
[New LWP 4426]
[New LWP 4427]
[New LWP 4428]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f362b8f2b57 in __GI___wait4 (pid=4435, stat_loc=0x7ffd6592c004, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f362b8f2b57 in __GI___wait4 (pid=4435, stat_loc=0x7ffd6592c004, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007f362ba3de28 in ggml_abort () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#2 0x00007f362bcb801a in ggml_backend_rpc_buffer_init_tensor(ggml_backend_buffer*, ggml_tensor*) () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#3 0x00007f362ba81eb8 in ggml_gallocr_alloc_graph () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#4 0x00007f362ba875eb in ggml_backend_sched_alloc_graph () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/ggml/src/libggml.so
#5 0x00007f364153afc8 in llama_decode_internal(llama_context&, llama_batch) () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/src/libllama.so
#6 0x00007f364153cf67 in llama_decode () from /media/yesh/wd/AI/llama.cpp/build-rpc-cuda/src/libllama.so
#7 0x0000561a93bf5127 in common_init_from_params(common_params&) ()
#8 0x0000561a93b8cc7d in server_context::load_model(common_params const&) ()
#9 0x0000561a93b3e70e in main ()
[Inferior 1 (process 4400) detached]
`
can you help me to properly run model with --cache-type-k q8_0 --cache-type-v q8_0 and rpc?
without using --cache-type-k q8_0 --cache-type-v q8_0 it doesn't make sense, because the remote machine only has 4GB of vram and with large volumes of k and v it works faster locally
Beta Was this translation helpful? Give feedback.
All reactions