[Performance]: vllm0.6.5加载GLM4-9B-Chat，动态加载lora，输入长文本时推理性能下降较多 #11317

zh19980310 · 2024-12-19T03:37:08Z

Proposal to improve performance

No response

Report of performance regression

A800，单卡处理单条请求

vllm0.6.5不加载lora
（1）启动：
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --trust-remote-code
（2）请求：
response = client.chat.completions.create(
model='/Work/....../glm-4-9b-chat/',
messages=messages,
n=1,
temperature=0,
extra_body={"stop_token_ids": [151329, 151336, 151338]},
max_tokens=2048,
stream=True)
vllm0.6.5动态加载lora
【lora模型使用llama_factory框架训练】
（1）启动：
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --enable-lora --max-loras 10 --lora-modules summary=/Work/....../sft_1218/ --trust-remote-code --max-lora-rank 64
（2）请求：
response = client.chat.completions.create(
model='summary',
messages=messages,
n=1,
temperature=0,
extra_body={"stop_token_ids": [151329, 151336, 151338]},
max_tokens=2048,
stream=True)

测试messages中输入不同长度文本时，不同情况下的推理速度：

发现加载lora后，输入文本较长时，推理速度相比于不加载lora下降较多，输入文本较短时下降不多
请问是什么原因造成的，我应该如何解决？谢谢~

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

jeejeelee · 2024-12-19T04:58:36Z

Let me first try to reproduce your results

jeejeelee · 2024-12-19T06:54:54Z

Please try increasing --max-seq-len-to-capture. The root cause is that when the sequences length exceeds max-seq-len-to-capture, it will fall back to eager mode, which affects LoRA performance.

zh19980310 · 2024-12-19T06:58:06Z

好的，我试一下，感谢~

jeejeelee · 2024-12-19T07:37:16Z

好的，我试一下，感谢~

If this can address your issue, please let me know, thanks!

zh19980310 · 2024-12-19T07:46:10Z

加上您提供的参数后推理速度问题已经解决，测试如下：

十分感谢~

noooop · 2024-12-19T07:54:09Z

长知识了

Candy555 · 2024-12-19T07:57:46Z

学习了~

jeejeelee · 2024-12-19T08:01:20Z

This might be a bug. I will investigate further and try to resolve it.

noooop · 2024-12-19T08:12:40Z

Are you using the default scheduler or the chunked_prefill scheduler?

default max-seq-len-to-capture = 8102

cuda graph also takes up additional GPU memory. I don’t think there is any need to increase it.

New versions should use chunked_prefill by default，but it's better to check.

Please try the configuration below:

enable_chunked_prefill = True
max_num_seqs=32
max_num_batched_tokens=2048 <- 2048 token can generally make the GPU reach saturation

jeejeelee · 2024-12-19T08:38:09Z

Are you using the default scheduler or the chunked_prefill scheduler?

default max-seq-len-to-capture = 8102

cuda graph also takes up additional GPU memory. I don’t think there is any need to increase it.

New versions should use chunked_prefill by default，but it's better to check.

Please try the configuration below:

enable_chunked_prefill = True max_num_seqs=32 max_num_batched_tokens=2048 <- 2048 token can generally make the GPU reach saturation

We impl Multi-LoRA based on triton, and triton has significant runtime overhead, so without using cudagraph, it will severely impact decode performance, the above is an example.

Only V1 uses chunked_prefill by default, the current release version still uses the default scheduler

noooop · 2024-12-19T09:07:39Z

We impl Multi-LoRA based on triton, and triton has significant runtime overhead, so without using cudagraph, it will severely impact decode performance, the above is an example.

So I hope @zh19980310 to test the chunked prefill scheduler.
Might have good performance without increasing max-seq-len-to-capture

Only V1 uses chunked_prefill by default, the current release version still uses the default scheduler

(I seem to have seen a PR similar to "Turn on chunked prefill by default". ok, just for v1.

zh19980310 · 2024-12-19T12:49:00Z

Are you using the default scheduler or the chunked_prefill scheduler?

default max-seq-len-to-capture = 8102

cuda graph also takes up additional GPU memory. I don’t think there is any need to increase it.

New versions should use chunked_prefill by default，but it's better to check.

Please try the configuration below:

enable_chunked_prefill = True max_num_seqs=32 max_num_batched_tokens=2048 <- 2048 token can generally make the GPU reach saturation

按照您提供的参数测试了一下：
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --enable-lora --lora-modules summary=/Work/....../sft_1218/ --trust-remote-code --max-lora-rank 64 --enable_chunked_prefill --max_num_seqs 32 --max_num_batched_tokens 2048
推理速度与问题描述处的测试结果无差异

noooop · 2024-12-20T06:09:43Z

enable_chunked_prefill = True max_num_seqs=32 max_num_batched_tokens=2048

推理速度与问题描述处的测试结果无差异

@jeejeelee

This configuration should use cuda graph.

Can't chunked_prefill + lora + cuda graph be used together?

zh19980310 added the performance Performance-related issues label Dec 19, 2024

DarkLight1337 assigned jeejeelee Dec 19, 2024

jeejeelee mentioned this issue Dec 19, 2024

[Performance]: decoding speed on long context #11286

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: vllm0.6.5加载GLM4-9B-Chat，动态加载lora，输入长文本时推理性能下降较多 #11317

[Performance]: vllm0.6.5加载GLM4-9B-Chat，动态加载lora，输入长文本时推理性能下降较多 #11317

zh19980310 commented Dec 19, 2024

jeejeelee commented Dec 19, 2024

jeejeelee commented Dec 19, 2024

zh19980310 commented Dec 19, 2024

jeejeelee commented Dec 19, 2024

zh19980310 commented Dec 19, 2024

noooop commented Dec 19, 2024

Candy555 commented Dec 19, 2024

jeejeelee commented Dec 19, 2024

noooop commented Dec 19, 2024 •

edited

Loading

jeejeelee commented Dec 19, 2024

noooop commented Dec 19, 2024 •

edited

Loading

zh19980310 commented Dec 19, 2024

noooop commented Dec 20, 2024

[Performance]: vllm0.6.5加载GLM4-9B-Chat，动态加载lora，输入长文本时推理性能下降较多 #11317

[Performance]: vllm0.6.5加载GLM4-9B-Chat，动态加载lora，输入长文本时推理性能下降较多 #11317

Comments

zh19980310 commented Dec 19, 2024

Proposal to improve performance

Report of performance regression

A800，单卡处理单条请求

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

jeejeelee commented Dec 19, 2024

jeejeelee commented Dec 19, 2024

zh19980310 commented Dec 19, 2024

jeejeelee commented Dec 19, 2024

zh19980310 commented Dec 19, 2024

noooop commented Dec 19, 2024

Candy555 commented Dec 19, 2024

jeejeelee commented Dec 19, 2024

noooop commented Dec 19, 2024 • edited Loading

jeejeelee commented Dec 19, 2024

noooop commented Dec 19, 2024 • edited Loading

zh19980310 commented Dec 19, 2024

noooop commented Dec 20, 2024

noooop commented Dec 19, 2024 •

edited

Loading

noooop commented Dec 19, 2024 •

edited

Loading