[V1] Optimize block table transfer from CPU to GPU #11401

WoosukKwon · 2024-12-22T01:09:13Z

No description provided.

Signed-off-by: Woosuk Kwon <[email protected]>

github-actions · 2024-12-22T01:09:25Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Woosuk Kwon <[email protected]>

youkaichao · 2024-12-23T05:24:03Z

csrc/prepare_inputs/copy_subranges.cu

+  int* d_matrix_tgt = matrix_tgt.data_ptr<int>();
+
+  // One thread block per row.
+  int blocks = n;


it seems this can easily oversubscribe GPU SMs.

youkaichao · 2024-12-23T05:25:21Z

csrc/prepare_inputs/copy_subranges.cu

+  int length = matrix_diff[row_id * 2 + 1];
+  int end = start + length;
+  int thread_idx = threadIdx.x;
+  for (int i = start + thread_idx; i < end; i += blockDim.x) {


most threads in the block would be idle, e.g. for decoding, there's only one or even no entry changes in the block table.

youkaichao · 2024-12-23T05:49:20Z

vllm/v1/worker/gpu_block_table.py

+            self.block_table_diff_np[row_idx, 0] = start
+            # Move-and-append is not allowed.
+            assert self.block_table_diff_np[row_idx, 1] == 0
+            self.block_table_diff_np[row_idx, 1] = num_blocks


for the non-uva case, we still need to keep track of the max-block-table-length, so that apply_diff only needs to copy max-block-table-length columns.

Good point. The problem is, the memcpy API requires the data to be in contiguous memory space: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79

So when the block table tensor has the shape [batch_size, max_model_len] and if we slice over the second dimension, then we have to call the memcpy API batch_size times instead of once.

Signed-off-by: Woosuk Kwon <[email protected]>

wip

1aaced5

Signed-off-by: Woosuk Kwon <[email protected]>

mergify bot added the ci/build label Dec 22, 2024

WoosukKwon added 3 commits December 21, 2024 17:11

yapf

8a4180c

Signed-off-by: Woosuk Kwon <[email protected]>

Minor

03b1e6f

Signed-off-by: Woosuk Kwon <[email protected]>

Minor

0a669ee

Signed-off-by: Woosuk Kwon <[email protected]>

youkaichao reviewed Dec 23, 2024

View reviewed changes

WoosukKwon added 3 commits December 22, 2024 22:16

Use default

ee965c9

Signed-off-by: Woosuk Kwon <[email protected]>

Merge branch 'main' into v1-blocktable-opt

0420fb2

comments

3fdbd8e

Signed-off-by: Woosuk Kwon <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Optimize block table transfer from CPU to GPU #11401

[V1] Optimize block table transfer from CPU to GPU #11401

WoosukKwon commented Dec 22, 2024

github-actions bot commented Dec 22, 2024

youkaichao Dec 23, 2024

youkaichao Dec 23, 2024

youkaichao Dec 23, 2024

WoosukKwon Dec 23, 2024 •

edited

Loading

[V1] Optimize block table transfer from CPU to GPU #11401

Are you sure you want to change the base?

[V1] Optimize block table transfer from CPU to GPU #11401

Conversation

WoosukKwon commented Dec 22, 2024

github-actions bot commented Dec 22, 2024

youkaichao Dec 23, 2024

Choose a reason for hiding this comment

youkaichao Dec 23, 2024

Choose a reason for hiding this comment

youkaichao Dec 23, 2024

Choose a reason for hiding this comment

WoosukKwon Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

WoosukKwon Dec 23, 2024 •

edited

Loading