如何避免 VLM OOM的问题？ #2887

chenzhengda · 2024-12-12T08:17:20Z

在使用lmdeploy 推理 VLM (比如Qwen2VL)的时候经常会遇到OOM问题，主要是因为图片数量和分辨率导致的，有什么办法可以避免这个问题？

lvhan028 · 2024-12-13T04:02:17Z

@grimoire，这种情况在 PR #2810 有所缓解么？

irexyc · 2024-12-13T05:43:20Z

https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/multi_modal/qwen2_vl.md 这里面有写如何控制图片分辨率。

另外 Qwen2VL 这个模型之前我跑的时候发现，pytorch backend运行时 LLM部分显存增长会比较多。

chenzhengda · 2024-12-13T07:57:01Z

https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/multi_modal/qwen2_vl.md 这里面有写如何控制图片分辨率。

另外 Qwen2VL 这个模型之前我跑的时候发现，pytorch backend运行时 LLM部分显存增长会比较多。

想请教一下除了手动控制分辨率还有其他办法解决这个问题吗？因为很难保证用户的行为并且一旦OOM很可能导致服务崩溃

irexyc · 2024-12-13T08:00:57Z

@chenzhengda

vision 部分控制了分辨率，还有哪些用户行为呢？

chenzhengda · 2024-12-13T08:04:44Z

@chenzhengda

vision 部分控制了分辨率，还有哪些用户行为呢？

比如用户一次性传了很多张图，或者同时有多租户的并行请求(我不是很确定在lmdeploy中不同请求做vision encoder是不是并行的，有没有可能导致OOM)

irexyc · 2024-12-13T08:10:23Z

有个参数能控制同时处理的图片数量 https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/messages.py#L378

默认的话就是一张图一张图的进行处理的，控制了分辨率，就控制vision部分的显存。

LLM 部分你设置下session_len，然后用最大的session_len 和 batch 提前跑跑看吧，这个模型我之前看运行时显存增长比刚加载完变化挺大的。

chenzhengda · 2024-12-13T08:12:30Z

有个参数能控制同时处理的图片数量 https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/messages.py#L378

默认的话就是一张图一张图的进行处理的，控制了分辨率，就控制vision部分的显存。

LLM 部分你设置下session_len，然后用最大的session_len 和 batch 提前跑跑看吧，这个模型我之前看运行时显存增长比刚加载完变化挺大的。

感谢🙏

vladrad · 2024-12-13T20:27:45Z

Hello! I believe I have the same issue as well. I get an OOM on my 4090 RTX after 1-3 requests.

I tried:
-Set different quants 0,4, 8 and they all OOM for me.
-I tried to set different context sizes
-I tried Turbomind vs PyTorch.

Is it reasonable for me to assume that a 4090 should be ok with running one image at a time with this command without OOM given my images are 1000x1000:

docker run --runtime nvidia --gpus 0     -v ~/.cache/huggingface:/root/.cache/huggingface     --env "HUGGING_FACE_HUB_TOKEN="     -p 23333:23333 --ipc=host openmmlab/lmdeploy:latest     lmdeploy serve api_server openbmb/MiniCPM-V-2_6 --model-name MiniCPM-V-2_6-Vison

chenzhengda · 2024-12-20T08:14:52Z

@irexyc 发现另外一个问题，好像目前vision encoder部分没有默认使用flash-attn，因此推理需要的显存会比较大，您能帮忙确认一下吗？

lvhan028 assigned irexyc Dec 13, 2024

lvhan028 added the awaiting response label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

如何避免 VLM OOM的问题？ #2887

如何避免 VLM OOM的问题？ #2887

chenzhengda commented Dec 12, 2024

lvhan028 commented Dec 13, 2024

irexyc commented Dec 13, 2024

chenzhengda commented Dec 13, 2024

irexyc commented Dec 13, 2024

chenzhengda commented Dec 13, 2024

irexyc commented Dec 13, 2024

chenzhengda commented Dec 13, 2024

vladrad commented Dec 13, 2024

chenzhengda commented Dec 20, 2024

如何避免 VLM OOM的问题？ #2887

如何避免 VLM OOM的问题？ #2887

Comments

chenzhengda commented Dec 12, 2024

lvhan028 commented Dec 13, 2024

irexyc commented Dec 13, 2024

chenzhengda commented Dec 13, 2024

irexyc commented Dec 13, 2024

chenzhengda commented Dec 13, 2024

irexyc commented Dec 13, 2024

chenzhengda commented Dec 13, 2024

vladrad commented Dec 13, 2024

chenzhengda commented Dec 20, 2024