Replies: 2 comments 6 replies
-
I have laptop Intel Core i5 with 4 physical cores, running 13B q4_0 gives me approximately 2.3 tokens per second. I heard that q4_1 is more precise but slower by 50%, though that doesn't explain 2-10 seconds per word. What processor features your compiled binary reports? Could it be you compiled it without AVX or something? Mine was compiled with w64devkit, I had problems with cygwin's g++ producing broken binary that outputs complete garbage. |
Beta Was this translation helpful? Give feedback.
-
If you see a lot of disk I/O while generation is running, try setting --mlock to force the OS to keep the model in memory, rather than paging it out. You might be out of luck however because I don't know if the native Windows version supports mlock. Running under WSL might be an option. Setting --threads to half of the number of cores you have might help performance. For dealing with repetition, try setting these options: --ctx_size 2048 --repeat_last_n 2048 --keep -1 2048 tokens are the maximum context size that these models are designed to support, so this uses the full size and checks for repetitions over the entire context size. Setting keep to -1 keeps the entire initial prompt when space is freed up after the context fills up. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the help.
Solution.
I noticed that in the arguments it only was using 4 threads out of 20. So I increased it by doing something like -t 20 and it seems to be faster.
Beta Was this translation helpful? Give feedback.
All reactions