Llama.Assistant.RAG.Demo.-.1080p.mp4
🔧 Changes
- Utilize llama cpp KV cache mechanism to make faster inference. See (1)
- Summarize the chat history when it is about to exceeds the context length
- Recursive check and update missing setting from te DEFAULT CONFIG
- Add validators (type, min, max value) for input fields in the setting dialog
(1) llama cpp's KV cache check prefix of your chat history to reuse the K-V cache. For example:
- Generated sequence so far = "ABCDEF"
- If we modify the chat history somehow like: "ABCDXT". Then it matches prefix and reuses the cache for "ABCD" and newly computes the Key and Value vectors for "XT", then generates new responses.
-> So we need to make the most of this mechanism by keep the history prefix as fixed as possible