-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: vllm0.6.5加载GLM4-9B-Chat,动态加载lora,输入长文本时推理性能下降较多 #11317
Comments
Let me first try to reproduce your results |
Please try increasing |
好的,我试一下,感谢~ |
If this can address your issue, please let me know, thanks! |
长知识了 |
学习了~ |
This might be a bug. I will investigate further and try to resolve it. |
Are you using the default scheduler or the chunked_prefill scheduler? default max-seq-len-to-capture = 8102 cuda graph also takes up additional GPU memory. I don’t think there is any need to increase it. New versions should use chunked_prefill by default,but it's better to check. Please try the configuration below: enable_chunked_prefill = True |
We impl Multi-LoRA based on triton, and triton has significant runtime overhead, so without using cudagraph, it will severely impact decode performance, the above is an example. Only V1 uses chunked_prefill by default, the current release version still uses the default scheduler |
So I hope @zh19980310 to test the chunked prefill scheduler.
(I seem to have seen a PR similar to "Turn on chunked prefill by default". ok, just for v1. |
按照您提供的参数测试了一下: |
enable_chunked_prefill = True max_num_seqs=32 max_num_batched_tokens=2048
This configuration should use cuda graph. Can't chunked_prefill + lora + cuda graph be used together? |
Proposal to improve performance
No response
Report of performance regression
A800,单卡处理单条请求
vllm0.6.5不加载lora
(1)启动:
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --trust-remote-code
(2)请求:
response = client.chat.completions.create(
model='/Work/....../glm-4-9b-chat/',
messages=messages,
n=1,
temperature=0,
extra_body={"stop_token_ids": [151329, 151336, 151338]},
max_tokens=2048,
stream=True)
vllm0.6.5动态加载lora
【lora模型使用llama_factory框架训练】
(1)启动:
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --enable-lora --max-loras 10 --lora-modules summary=/Work/....../sft_1218/ --trust-remote-code --max-lora-rank 64
(2)请求:
response = client.chat.completions.create(
model='summary',
messages=messages,
n=1,
temperature=0,
extra_body={"stop_token_ids": [151329, 151336, 151338]},
max_tokens=2048,
stream=True)
测试messages中输入不同长度文本时,不同情况下的推理速度:
发现加载lora后,输入文本较长时,推理速度相比于不加载lora下降较多,输入文本较短时下降不多
请问是什么原因造成的,我应该如何解决?谢谢~
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: