You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I guess we can extend ggml to be able to choose work chunk distribution method - either at compile time, or via a context parameter. We can factor out the range selections from the ggml forward implementations to make implementation more concise and extensible in the future
Another thing to be investigated is the usage of sched_yield() and potentially making it user configurable:
Making this configurable would also be nice for the cuBLAS backend. When the whole model fits on the GPU, increasing the number of threads doesn't improve token/sec eval time.
But it does increase the CPU load on the system due to the busy loop. Even with n_thread = 1 , I suspect that a lot of CPU cycles are wasted in the busy loop.
So a yield flag would be a great addition to give the user control.
A busy-loop with a fallback to a yield might also be a good 'automatic' solution, that could be used as default.
See ggerganov/llama.cpp#1507
And comment: ggerganov/llama.cpp#1507 (comment)
Another thing to be investigated is the usage of
sched_yield()
and potentially making it user configurable:ggerganov/whisper.cpp@09a6325
The text was updated successfully, but these errors were encountered: