Run '.model' in ToolMate AI prompt and select 'llamacppserver' as LLM interface. This option is designed for advanced users who want more control over the LLM backend, particularly useful for customisation like GPU acceleration.
-
You need an access to a server that have llama.cpp running on it. If you don't, you may want to build and run a llama.cpp server on your device, to customise everything to suit your own needs.
-
Specify the llama.cpp server ip address and port in ToolMate AI settings.
On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU. Therefore, compiling source is a simple one:
To compile llama.cpp from source:
cd ~
cd llama.cpp
make
To configure ToolMate AI:
-
Run 'toolmate' in your environment
-
Enter '.model' in ToolMate AI prompt.
-
Follow the instructions to enter command line, server ip, port and timeout settings.
To start up the server, e.g.
~/llama.cpp/llama-server --host 127.0.0.1 --port 8080 --threads $(sysctl -n hw.physicalcpu) --ctx-size 0 --chat-template chatml --parallel 2 --model ~/models/wizardlm2.gguf
Make sure you have your LLM file in *.gguf format downloaded before starting up the server.
Brief description about the options:
--threads $(sysctl -n hw.physicalcpu): set the threads to the number of physical CPU cores
--ctx-size: size of the prompt context (default: 0, 0 = loaded from model)
--parallel 2: set number of slots for process requests to 2
For more options:
cd llama.cpp
./server -h
Inference result is roughly 1.5x faster. Read https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu/blob/main/igpu_only/igpu_only.md
Tested device: Beelink GTR6 (Ryzen 9 6900HX CPU + integrated Radeon 680M GPU + 64GB RAM)
Followed https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu/blob/main/README.md for ROCm installation.
Environment variables:
export ROCM_HOME=/opt/rocm
export LD_LIBRARY_PATH=/opt/rocm/include:/opt/rocm/lib:$LD_LIBRARY_PATH
export PATH=$HOME/.local/bin:/opt/rocm/bin:/opt/rocm/llvm/bin:$PATH
export HSA_OVERRIDE_GFX_VERSION=10.3.0
Compile Llama.cpp from source:
cd ~
make GGML_HIPBLAS=1 GGML_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 -j$(lscpu | grep '^Core(s)' | awk '{print $NF}')
To start up the server, e.g.
~/llama.cpp/llama-server --host 127.0.0.1 --port 8080 --threads $(lscpu | grep '^Core(s)' | awk '{print $NF}') --ctx-size 0 --chat-template chatml --parallel 2 --gpu-layers 999 --model ~/models/wizardlm2.gguf
Please note we used --gpu-layers
in the command above. You may want to change the its value 33 to suit your case.
--gpu-layers: number of layers to store in VRAM
Make sure you have your LLM file in *.gguf format downloaded before starting up the server.
Tested on Ubuntu with Dual AMD RX 7900 XTX. Full setup notes are documented at https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu/blob/main/README.md
Compile Llama.cpp from source:
cd ~
make GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1100 -j$(lscpu | grep '^Core(s)' | awk '{print $NF}')
To start up the server, e.g.
~/llama.cpp/llama-server --host 127.0.0.1 --port 8080 --threads $(lscpu | grep '^Core(s)' | awk '{print $NF}') --ctx-size 0 --chat-template chatml --parallel 2 --gpu-layers 999 --model ~/models/wizardlm2.gguf
Make sure you have your LLM file in *.gguf format downloaded before starting up the server.
Compile Llama.cpp from source:
cd ~
make GGML_CUDA=1 -j$(lscpu | grep '^Core(s)' | awk '{print $NF}')
To start up the server, e.g.
~/llama.cpp/llama-server --host 127.0.0.1 --port 8080 --threads $(lscpu | grep '^Core(s)' | awk '{print $NF}') --ctx-size 0 --chat-template chatml --parallel 2 --gpu-layers 999 --model ~/models/wizardlm2.gguf
Make sure you have your LLM file in *.gguf format downloaded before starting up the server.
-
Enter
.model
in ToolMate AI prompt. -
Select
Llama.cpp Server [advanced]
. -
Follow the dialogs to enter your Llama.cpp IP addresses and ports.
https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build