-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled #10388
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled #10388
Conversation
Signed-off-by: imkero <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: imkero <[email protected]>
|
@DarkLight1337 I think after #8346 merged, vLLM seems be able to support multi-modal models' chunked prefill? See vllm/tests/models/decoder_only/audio_language/test_ultravox.py Lines 22 to 27 in 1d75472
and also Lines 288 to 328 in 1d75472
I am using a forked vLLM which added chunked prefill and prefix caching for Qwen2-VL only. And I found there is a MRotaryEmbedding's fault which prevents Qwen2-VL's chunked prefill from working properly (trying to fix it in this PR). It seems some other changes should be made in Qwen2-VL's impl to support chunked prefill, so this PR is a very early fix. Maybe I can help adding chunked prefill to Qwen2-VL as well? |
Oh, I confused it with speculative decoding, sorry. Chunked prefill is supported but not implemented for Qwen2-VL yet. |
Feel free to open another PR for full support. Meanwhile we can fix M-RoPE using this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks reasonable, but let's see whether the tests can pass. Thanks for the detailed explanation!
Please help retry this CI? https://buildkite.com/vllm/fastcheck/builds/8107
|
Nice work! |
…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>
…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>
…led (vllm-project#10388) Signed-off-by: imkero <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>
…led (vllm-project#10388) Signed-off-by: imkero <[email protected]> Signed-off-by: rickyx <[email protected]>
…led (vllm-project#10388) Signed-off-by: imkero <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>
…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>
…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>
Fix
MRotaryEmbedding
'sget_input_positions
when chunked prefill is enabled.It only slice at the left-hand side of generated
llm_positions
currently (forgetting the right-hand side). This PR add right-hand slice position in it to support chunked prefill.vllm/vllm/model_executor/layers/rotary_embedding.py
Lines 923 to 928 in 1d75472
Explanation
To make it more clear, here is an example with following configuration:
assume a
len=40
promptenable_chunked_prefill=True
, andmax_num_batched_tokens=32
add some log in
model_runner.py::ModelInputForGPUBuilder::build
nearvllm/vllm/worker/model_runner.py
Lines 952 to 957 in 1d75472
Result:
Related error log:
the error occurs near:
vllm/vllm/model_executor/layers/rotary_embedding.py
Lines 807 to 825 in 1d75472
About the test I added
Qwen2-VL's M-RoPE works only when there are some multi-modal inputs,
so an image is included in the inputs
however, Qwen2-VL currently won't work properly when chunked prefill is enabled and there are some multi-modal inputs (it assumes the input is never chunked)
vllm/vllm/model_executor/models/qwen2_vl.py
Lines 1229 to 1238 in 1d75472
here use a hacky way: provide a zero-length image to make it happy
and finally we achieved these requirements to allow our test continue