[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled #10388

imkero · 2024-11-16T10:21:01Z

Fix MRotaryEmbedding's get_input_positions when chunked prefill is enabled.

It only slice at the left-hand side of generated llm_positions currently (forgetting the right-hand side). This PR add right-hand slice position in it to support chunked prefill.

vllm/vllm/model_executor/layers/rotary_embedding.py

Lines 923 to 928 in 1d75472

    
           llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1) 
        
           llm_positions = llm_positions[:, context_len:] 
        
           mrope_position_delta = (llm_positions.max() + 1 - 
        
                                   len(input_tokens)).item() 
        
           return llm_positions.tolist(), mrope_position_delta

Explanation

To make it more clear, here is an example with following configuration:

assume a len=40 prompt
enable_chunked_prefill=True, and max_num_batched_tokens=32

add some log in model_runner.py::ModelInputForGPUBuilder::build near

vllm/vllm/worker/model_runner.py

Lines 952 to 957 in 1d75472

    
           return self.model_input_cls( 
        
               input_tokens=input_tokens_tensor, 
        
               input_positions=input_positions_tensor, 
        
               attn_metadata=attn_metadata, 
        
               seq_lens=seq_lens, 
        
               query_lens=query_lens,

Result:

step

before this fix

after this fix

1^st prefill chunk

context_lens: [0]
query_lens: [32]
seq_lens: [32]
input_tokens_lens: [40]
mrope_input_positions: torch.Size([3, 40])

context_lens: [0]
query_lens: [32]
seq_lens: [32]
input_tokens_lens: [40]
mrope_input_positions: torch.Size([3, 32])

2^nd prefill chunk

broken in prev step

context_lens: [32]
query_lens: [8]
seq_lens: [40]
input_tokens_lens: [40]
mrope_input_positions: torch.Size([3, 8])

1^st decode

broken in prev step

context_lens: [40]
query_lens: [1]
seq_lens: [41]
input_tokens_lens: [40]
mrope_input_positions: torch.Size([3, 1])

Related error log:

RuntimeError: shape '[40, -1, 128]' is invalid for input of size 49152

the error occurs near:

vllm/vllm/model_executor/layers/rotary_embedding.py

Lines 807 to 825 in 1d75472

    
           num_tokens = positions.shape[-1] 
        
           cos_sin = self.cos_sin_cache[positions] 
        
           cos, sin = cos_sin.chunk(2, dim=-1) 
        
           if positions.ndim == 2: 
        
               assert self.mrope_section 
        
               cos = torch.cat([ 
        
                   m[i] 
        
                   for i, m in enumerate(cos.split(self.mrope_section, dim=-1)) 
        
               ], 
        
                               dim=-1) 
        
               sin = torch.cat([ 
        
                   m[i] 
        
                   for i, m in enumerate(sin.split(self.mrope_section, dim=-1)) 
        
               ], 
        
                               dim=-1) 
        
           query_shape = query.shape 
        
           query = query.view(num_tokens, -1, self.head_size)

About the test I added

Qwen2-VL's M-RoPE works only when there are some multi-modal inputs,
so an image is included in the inputs

however, Qwen2-VL currently won't work properly when chunked prefill is enabled and there are some multi-modal inputs (it assumes the input is never chunked)

vllm/vllm/model_executor/models/qwen2_vl.py

Lines 1229 to 1238 in 1d75472

    
           def _merge_multimodal_embeddings( 
        
               self, 
        
               input_ids: torch.Tensor, 
        
               inputs_embeds: torch.Tensor, 
        
               multimodal_embeddings: torch.Tensor, 
        
               placeholder_token_id: int, 
        
           ) -> torch.Tensor: 
        
               mask = (input_ids == placeholder_token_id) 
        
               inputs_embeds[mask, :] = multimodal_embeddings 
        
               return inputs_embeds

here use a hacky way: provide a zero-length image to make it happy

and finally we achieved these requirements to allow our test continue
- chunked prefill enabled
- M-RoPE works

Signed-off-by: imkero <[email protected]>

github-actions · 2024-11-16T10:21:12Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: imkero <[email protected]>

DarkLight1337 · 2024-11-16T12:17:24Z

~~@ywang96 I thought chunked prefill isn't supported for VLMs yet? Or is this just not tested properly?~~

imkero · 2024-11-16T14:44:41Z

@ywang96 I thought chunked prefill isn't supported for VLMs yet? Or is this just not tested properly?

@DarkLight1337 I think after #8346 merged, vLLM seems be able to support multi-modal models' chunked prefill? See

vllm/tests/models/decoder_only/audio_language/test_ultravox.py

Lines 22 to 27 in 1d75472

    
           CHUNKED_PREFILL_KWARGS = { 
        
               "enable_chunked_prefill": True, 
        
               "max_num_seqs": 2, 
        
               # Use a very small limit to exercise chunked prefill. 
        
               "max_num_batched_tokens": 16 
        
           }

and also

vllm/vllm/multimodal/base.py

Lines 288 to 328 in 1d75472

    
               @classmethod 
        
               def from_seq_group( 
        
                   cls, seq_group: "SequenceGroupMetadata", positions: range 
        
               ) -> Tuple[Optional[MultiModalDataDict], Dict[str, 
        
                                                             "MultiModalPlaceholderMap"]]: 
        
                   """ 
        
                   Returns the multi-modal items that intersect with the portion of a 
        
                   prompt (``seq_group``) represented by ``positions``, as well as a 
        
                   ``MultiModalPlaceholderMap`` that relates the multi-modal embedding 
        
                   vectors to their corresponding placeholders. 
        
                   Consider the following scenarios: 
        
                      Prompt: |AAAA BBBB What's in these images?| 
        
                   Positions: |.................................| 
        
                       images      = [A, B] 
        
                       src_ranges  = [(0, 4), (4, 8)] 
        
                       dest_ranges = [(0, 4), (5, 9)] 
        
                      Prompt: |AAAA BBBB What's in these images?| 
        
                   Positions: |  .....                          | 
        
                       images      = [A, B] 
        
                       src_ranges  = [(2, 4), (4, 6)] 
        
                       dest_ranges = [(0, 2), (3, 5)] 
        
                      Prompt: |AAAA BBBB What's in these images?| 
        
                   Positions: |     .........                   | 
        
                       images      = [B] 
        
                       src_ranges  = [(0, 4)] 
        
                       dest_ranges = [(0, 4)] 
        
                      Prompt: |AAAA BBBB What's in these images?| 
        
                   Positions: |          .......................| 
        
                       images      = [] 
        
                       src_ranges  = [] 
        
                       dest_ranges = [] 
        
                   """

I am using a forked vLLM which added chunked prefill and prefix caching for Qwen2-VL only. And I found there is a MRotaryEmbedding's fault which prevents Qwen2-VL's chunked prefill from working properly (trying to fix it in this PR).

It seems some other changes should be made in Qwen2-VL's impl to support chunked prefill, so this PR is a very early fix.

Maybe I can help adding chunked prefill to Qwen2-VL as well?

DarkLight1337 · 2024-11-16T15:55:26Z

@ywang96 I thought chunked prefill isn't supported for VLMs yet? Or is this just not tested properly?

@DarkLight1337 I think after #8346 merged, vLLM seems be able to support multi-modal models' chunked prefill? See

vllm/tests/models/decoder_only/audio_language/test_ultravox.py

Lines 22 to 27 in 1d75472

CHUNKED_PREFILL_KWARGS = {

"enable_chunked_prefill": True,

"max_num_seqs": 2,

# Use a very small limit to exercise chunked prefill.

"max_num_batched_tokens": 16

}

and also

vllm/vllm/multimodal/base.py

Lines 288 to 328 in 1d75472

@classmethod

def from_seq_group(

cls, seq_group: "SequenceGroupMetadata", positions: range

) -> Tuple[Optional[MultiModalDataDict], Dict[str,

"MultiModalPlaceholderMap"]]:

"""

Returns the multi-modal items that intersect with the portion of a

prompt (``seq_group``) represented by ``positions``, as well as a

``MultiModalPlaceholderMap`` that relates the multi-modal embedding

vectors to their corresponding placeholders.

Consider the following scenarios:

Prompt: |AAAA BBBB What's in these images?|

Positions: |.................................|

images = [A, B]

src_ranges = [(0, 4), (4, 8)]

dest_ranges = [(0, 4), (5, 9)]

Prompt: |AAAA BBBB What's in these images?|

Positions: | ..... |

images = [A, B]

src_ranges = [(2, 4), (4, 6)]

dest_ranges = [(0, 2), (3, 5)]

Prompt: |AAAA BBBB What's in these images?|

Positions: | ......... |

images = [B]

src_ranges = [(0, 4)]

dest_ranges = [(0, 4)]

Prompt: |AAAA BBBB What's in these images?|

Positions: | .......................|

images = []

src_ranges = []

dest_ranges = []

"""

I am using a forked vLLM which added chunked prefill and prefix caching for Qwen2-VL only. And I found there is a MRotaryEmbedding's fault which prevents Qwen2-VL's chunked prefill from working properly (trying to fix it in this PR).

It seems some other changes should be made in Qwen2-VL's impl to support chunked prefill, so this PR is a very early fix.

Maybe I can help adding chunked prefill to Qwen2-VL as well?

Oh, I confused it with speculative decoding, sorry. Chunked prefill is supported but not implemented for Qwen2-VL yet.

DarkLight1337 · 2024-11-16T15:58:08Z

Feel free to open another PR for full support. Meanwhile we can fix M-RoPE using this PR.

DarkLight1337

The code looks reasonable, but let's see whether the tests can pass. Thanks for the detailed explanation!

imkero · 2024-11-16T17:04:58Z

The code looks reasonable, but let's see whether the tests can pass. Thanks for the detailed explanation!

Please help retry this CI? https://buildkite.com/vllm/fastcheck/builds/8107
It failed because of a network problem.

requests.exceptions.ReadTimeout: (ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)")

DarkLight1337 · 2024-11-16T18:09:54Z

Nice work!

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]> Signed-off-by: rickyx <[email protected]>

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>

[Bugfix] M-RoPE position calculation when chunked prefill is enabled

5565fc8

Signed-off-by: imkero <[email protected]>

imkero requested review from DarkLight1337 and ywang96 as code owners November 16, 2024 10:21

format fix

5240856

Signed-off-by: imkero <[email protected]>

imkero changed the title ~~[Bugfix] M-RoPE position calculation when chunked prefill is enabled~~ [Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled Nov 16, 2024

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 16, 2024

DarkLight1337 approved these changes Nov 16, 2024

View reviewed changes

DarkLight1337 merged commit 361c29e into vllm-project:main Nov 16, 2024
64 checks passed

imkero mentioned this pull request Nov 17, 2024

[Bugfix] Fix mrope_position_delta in non-last prefill chunk #10403

Merged

coolkp pushed a commit to coolkp/vllm that referenced this pull request Nov 20, 2024

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enab…

7430a20

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enab…

a0569ee

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>

rickyyx pushed a commit to rickyyx/vllm that referenced this pull request Nov 20, 2024

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enab…

970fc10

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]> Signed-off-by: rickyx <[email protected]>

prashantgupta24 pushed a commit to opendatahub-io/vllm that referenced this pull request Dec 3, 2024

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enab…

068451f

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enab…

e3f39a4

…led (vllm-project#10388) Signed-off-by: imkero <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled #10388

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled #10388

imkero commented Nov 16, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 16, 2024

DarkLight1337 commented Nov 16, 2024 •

edited

Loading

imkero commented Nov 16, 2024 •

edited

Loading

DarkLight1337 commented Nov 16, 2024

DarkLight1337 commented Nov 16, 2024

DarkLight1337 left a comment

imkero commented Nov 16, 2024 •

edited

Loading

DarkLight1337 commented Nov 16, 2024

	llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1)
	llm_positions = llm_positions[:, context_len:]
	mrope_position_delta = (llm_positions.max() + 1 -
	len(input_tokens)).item()

	return llm_positions.tolist(), mrope_position_delta

	return self.model_input_cls(
	input_tokens=input_tokens_tensor,
	input_positions=input_positions_tensor,
	attn_metadata=attn_metadata,
	seq_lens=seq_lens,
	query_lens=query_lens,

	num_tokens = positions.shape[-1]
	cos_sin = self.cos_sin_cache[positions]
	cos, sin = cos_sin.chunk(2, dim=-1)
	if positions.ndim == 2:
	assert self.mrope_section

	cos = torch.cat([
	m[i]
	for i, m in enumerate(cos.split(self.mrope_section, dim=-1))
	],
	dim=-1)
	sin = torch.cat([
	m[i]
	for i, m in enumerate(sin.split(self.mrope_section, dim=-1))
	],
	dim=-1)

	query_shape = query.shape
	query = query.view(num_tokens, -1, self.head_size)

	def _merge_multimodal_embeddings(
	self,
	input_ids: torch.Tensor,
	inputs_embeds: torch.Tensor,
	multimodal_embeddings: torch.Tensor,
	placeholder_token_id: int,
	) -> torch.Tensor:
	mask = (input_ids == placeholder_token_id)
	inputs_embeds[mask, :] = multimodal_embeddings
	return inputs_embeds

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled #10388

[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled #10388

Conversation

imkero commented Nov 16, 2024 • edited by github-actions bot Loading

Explanation

About the test I added

github-actions bot commented Nov 16, 2024

DarkLight1337 commented Nov 16, 2024 • edited Loading

imkero commented Nov 16, 2024 • edited Loading

DarkLight1337 commented Nov 16, 2024

DarkLight1337 commented Nov 16, 2024

DarkLight1337 left a comment

Choose a reason for hiding this comment

imkero commented Nov 16, 2024 • edited Loading

DarkLight1337 commented Nov 16, 2024

imkero commented Nov 16, 2024 •

edited by github-actions bot

Loading

DarkLight1337 commented Nov 16, 2024 •

edited

Loading

imkero commented Nov 16, 2024 •

edited

Loading

imkero commented Nov 16, 2024 •

edited

Loading