Extremely slow? #861

viperwasp · 2023-04-09T02:00:31Z

viperwasp
Apr 9, 2023

Thanks for the help.

Solution.
I noticed that in the arguments it only was using 4 threads out of 20. So I increased it by doing something like -t 20 and it seems to be faster.

jarcen · 2023-04-09T04:11:49Z

jarcen
Apr 9, 2023

I have laptop Intel Core i5 with 4 physical cores, running 13B q4_0 gives me approximately 2.3 tokens per second. I heard that q4_1 is more precise but slower by 50%, though that doesn't explain 2-10 seconds per word. What processor features your compiled binary reports? Could it be you compiled it without AVX or something? Mine was compiled with w64devkit, I had problems with cygwin's g++ producing broken binary that outputs complete garbage.

4 replies

viperwasp Apr 9, 2023
Author

I really only just started using any of this today. I've used Stable Diffusion and chatgpt etc. But not Llama.cpp etc. I followed youtube guide to set this up. The video was posted today so a lot of people there are new to this as well.

I kind of understand what you said in the beginning.
I don't know anything about compiling or AVX. I just downloaded a model. And a bat file provided in the video. Basically I don't understand anything after you said AVX either. But the text this model outputs is very, very good. But it's so slow.

Thanks some of this kind of helps though. So even with 50% slower of q4_1 I should have around 1 token a second. But my CPU is also more powerful. So hmmm... How fast is one token per second? I should have at least that speed. Thank you Jarcen.

jarcen Apr 9, 2023

So, what exactly did you do when following tutorial? How the executable file was obtained? If you used existing binary then it was probably compiled with the most conservative feature set. In that case compiling your own will probably get you better performance.

viperwasp Apr 9, 2023
Author

Possibly but I don't know anything about that? I don't know what compiling means. So if this is over my head that's okay.

viperwasp Apr 9, 2023
Author

I noticed that in the arguments it only was using 4 threads out of 20. So I increased it by doing something like -t 20 and it seems to be faster. Not a lot faster but maybe twice the speed. It may be on par with what I should expect. However if anyone else has advice let me know and thanks.

hungerf3 · 2023-04-09T04:13:40Z

hungerf3
Apr 9, 2023

If you see a lot of disk I/O while generation is running, try setting --mlock to force the OS to keep the model in memory, rather than paging it out. You might be out of luck however because I don't know if the native Windows version supports mlock. Running under WSL might be an option.

Setting --threads to half of the number of cores you have might help performance.

For dealing with repetition, try setting these options: --ctx_size 2048 --repeat_last_n 2048 --keep -1

2048 tokens are the maximum context size that these models are designed to support, so this uses the full size and checks for repetitions over the entire context size. Setting keep to -1 keeps the entire initial prompt when space is freed up after the context fills up.

2 replies

viperwasp Apr 9, 2023
Author

I checked I/O I think. I just checked the whole drive. I don't think anything much is happening on my drives like 35-100kb/s.
I think I have so much memory left. 42GB too. But I don't know how it tries to assign memory etc. Either way I don't think it's paging?

Here is what the bat file I downloaded has.

title llama.cpp
:start
main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.3 --instruct -m ggml-model-q4_1.bin
pause
goto start

I changed the --repeat_penalty from 1.2 to 1.3 recently. Not sure if that was smart or not but it still works.
I've had a little less issues with repeating recently. But I am doing to try to set the affinity to half cores perhaps. Or try mlock.

I don't know what WSL is? But I can google that. This is basically my first day running anything like this. The closest I've done is I've used Stable Diffusion on my PC a lot. I will let people know here if I find a solution elsewhere. Or if any of this helps. Thank you hungerf3.

viperwasp Apr 9, 2023
Author

I noticed that in the arguments it only was using 4 threads out of 20. So I increased it by doing something like -t 20 and it seems to be faster. Not a lot faster but maybe twice the speed. It may be on par with what I should expect. However if anyone else has advice let me know and thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely slow? #861

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Extremely slow? #861

viperwasp Apr 9, 2023

Replies: 2 comments · 6 replies

jarcen Apr 9, 2023

viperwasp Apr 9, 2023 Author

jarcen Apr 9, 2023

viperwasp Apr 9, 2023 Author

viperwasp Apr 9, 2023 Author

hungerf3 Apr 9, 2023

viperwasp Apr 9, 2023 Author

viperwasp Apr 9, 2023 Author

viperwasp
Apr 9, 2023

Replies: 2 comments 6 replies

jarcen
Apr 9, 2023

viperwasp Apr 9, 2023
Author

viperwasp Apr 9, 2023
Author

viperwasp Apr 9, 2023
Author

hungerf3
Apr 9, 2023

viperwasp Apr 9, 2023
Author

viperwasp Apr 9, 2023
Author