Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: vllm0.6.5加载GLM4-9B-Chat,动态加载lora,输入长文本时推理性能下降较多 #11317

Open
1 task done
zh19980310 opened this issue Dec 19, 2024 · 13 comments
Assignees
Labels
performance Performance-related issues

Comments

@zh19980310
Copy link

Proposal to improve performance

No response

Report of performance regression

A800,单卡处理单条请求

  1. vllm0.6.5不加载lora
    (1)启动:
    CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --trust-remote-code
    (2)请求:
    response = client.chat.completions.create(
    model='/Work/....../glm-4-9b-chat/',
    messages=messages,
    n=1,
    temperature=0,
    extra_body={"stop_token_ids": [151329, 151336, 151338]},
    max_tokens=2048,
    stream=True)

  2. vllm0.6.5动态加载lora
    【lora模型使用llama_factory框架训练】
    (1)启动:
    CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --enable-lora --max-loras 10 --lora-modules summary=/Work/....../sft_1218/ --trust-remote-code --max-lora-rank 64
    (2)请求:
    response = client.chat.completions.create(
    model='summary',
    messages=messages,
    n=1,
    temperature=0,
    extra_body={"stop_token_ids": [151329, 151336, 151338]},
    max_tokens=2048,
    stream=True)

测试messages中输入不同长度文本时,不同情况下的推理速度:
d2dccaa39734cc6f41449b48aad6a65
发现加载lora后,输入文本较长时,推理速度相比于不加载lora下降较多,输入文本较短时下降不多
请问是什么原因造成的,我应该如何解决?谢谢~

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@zh19980310 zh19980310 added the performance Performance-related issues label Dec 19, 2024
@jeejeelee
Copy link
Collaborator

Let me first try to reproduce your results

@jeejeelee
Copy link
Collaborator

Please try increasing --max-seq-len-to-capture. The root cause is that when the sequences length exceeds max-seq-len-to-capture, it will fall back to eager mode, which affects LoRA performance.

@zh19980310
Copy link
Author

好的,我试一下,感谢~

@jeejeelee
Copy link
Collaborator

好的,我试一下,感谢~

If this can address your issue, please let me know, thanks!

@zh19980310
Copy link
Author

加上您提供的参数后推理速度问题已经解决,测试如下:
b9556a6b1f6b41eadbacca4607a7607
十分感谢~

@noooop
Copy link
Contributor

noooop commented Dec 19, 2024

长知识了

@Candy555
Copy link

学习了~

@jeejeelee
Copy link
Collaborator

This might be a bug. I will investigate further and try to resolve it.

@noooop
Copy link
Contributor

noooop commented Dec 19, 2024

Are you using the default scheduler or the chunked_prefill scheduler?

default max-seq-len-to-capture = 8102

cuda graph also takes up additional GPU memory. I don’t think there is any need to increase it.

New versions should use chunked_prefill by default,but it's better to check.

Please try the configuration below:

enable_chunked_prefill = True
max_num_seqs=32
max_num_batched_tokens=2048 <- 2048 token can generally make the GPU reach saturation

@jeejeelee
Copy link
Collaborator

Are you using the default scheduler or the chunked_prefill scheduler?

default max-seq-len-to-capture = 8102

cuda graph also takes up additional GPU memory. I don’t think there is any need to increase it.

New versions should use chunked_prefill by default,but it's better to check.

Please try the configuration below:

enable_chunked_prefill = True max_num_seqs=32 max_num_batched_tokens=2048 <- 2048 token can generally make the GPU reach saturation

We impl Multi-LoRA based on triton, and triton has significant runtime overhead, so without using cudagraph, it will severely impact decode performance, the above is an example.

Only V1 uses chunked_prefill by default, the current release version still uses the default scheduler

@noooop
Copy link
Contributor

noooop commented Dec 19, 2024

We impl Multi-LoRA based on triton, and triton has significant runtime overhead, so without using cudagraph, it will severely impact decode performance, the above is an example.

So I hope @zh19980310 to test the chunked prefill scheduler.
Might have good performance without increasing max-seq-len-to-capture

Only V1 uses chunked_prefill by default, the current release version still uses the default scheduler

(I seem to have seen a PR similar to "Turn on chunked prefill by default". ok, just for v1.

@zh19980310
Copy link
Author

Are you using the default scheduler or the chunked_prefill scheduler?

default max-seq-len-to-capture = 8102

cuda graph also takes up additional GPU memory. I don’t think there is any need to increase it.

New versions should use chunked_prefill by default,but it's better to check.

Please try the configuration below:

enable_chunked_prefill = True max_num_seqs=32 max_num_batched_tokens=2048 <- 2048 token can generally make the GPU reach saturation

按照您提供的参数测试了一下:
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --enable-lora --lora-modules summary=/Work/....../sft_1218/ --trust-remote-code --max-lora-rank 64 --enable_chunked_prefill --max_num_seqs 32 --max_num_batched_tokens 2048
推理速度与问题描述处的测试结果无差异

@noooop
Copy link
Contributor

noooop commented Dec 20, 2024

enable_chunked_prefill = True max_num_seqs=32 max_num_batched_tokens=2048

推理速度与问题描述处的测试结果无差异

@jeejeelee

This configuration should use cuda graph.

Can't chunked_prefill + lora + cuda graph be used together?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

4 participants