Replies: 3 comments 1 reply
-
There should be no allocations after the first few evaluations. Please include a log that shows the OOM error. |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Beta Was this translation helpful? Give feedback.
1 reply
-
Thanks for answer. But if static buffers are more than free memory why won't it fail outright? It's such a waste of time. And I actually plan to use the slot KV cache elsewhere with CPU-only inference so that's fine. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to process large prompt. I tuned kv cache quantization, offloading as many layers as possible to GPU, it starts processing and all looks fine...and after few hours it fails with OOM.
nvidia-smi shows GPU memory usage by llama.cpp steadily creeping up. Is this expected behavior and if so how much reserve should I keep? Seems like 10% is needed.
Searched for similar discussions but the topic was allocation failure before submitting any prompt. This happens in the middle of processing.
Beta Was this translation helpful? Give feedback.
All reactions