Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Optimization of LLM inference) Does Intel OpenVINO support offloading LLM models, allowing some layers to remain on the SSD while loading the main layers into RAM during inference computation? #2533

Open
hsulin0806 opened this issue Nov 19, 2024 · 2 comments

Comments

@hsulin0806
Copy link

Functional discussion for this project.
notebooks/llm-chatbot

Intel's official documentation: https://www.intel.com.tw/content/www/tw/zh/content-details/826081/running-ollama-with-open-webui-on-intel-hardware-platform.html
confirms support for Ollama.

In Ollama's GitHub documentation: https://github.com/ollama/ollama/blob/main/docs/faq.md, it describes:

100% GPU: The model is fully loaded into the GPU.
100% CPU: The model is fully loaded into system memory.
48%/52% CPU/GPU: The model is split between the GPU and system memory.
Ollama is powered by llama.cpp, which supports the --gpu-layers parameter to distribute model layers between VRAM and RAM, reducing GPU memory pressure.

However, when the CPU handles inference, the model is entirely loaded into RAM. Would it be possible for OpenVINO to introduce a parameter or functionality to support offloading model layers to SSD storage as temporary storage? This would reduce RAM usage, offering a more efficient way to handle resource-limited scenarios.

@brmarkus
Copy link

Besides CPU, GPU, NPU, (VPU, FPGA), AUTO and MULTI, have you tried to experiment with HETERO (see "https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html")?

@hsulin0806
Copy link
Author

Besides CPU, GPU, NPU, (VPU, FPGA), AUTO and MULTI, have you tried to experiment with HETERO (see "https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html")?

Hi, thank you for your response.

Does the HETERO mode allow RAM to be cached on an SSD to reduce RAM usage? If this functionality is not available, do you have any development plans to enable caching RAM on an SSD?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants