Prefix Cache Aware Scheduling #1

rickyyx · 2024-10-30T22:11:02Z

FILL IN THE PR DESCRIPTION HERE

FIX vllm-project#7883 in V0

Problem and Motivation

With current impl in main, there are at least 2 places where scheduling is not optimal:

When deciding if a sequence could be allocated for prefill, it doesn't take into account the already computed blocks.
When deciding how many new tokens should be used for scheduling budget, it doesn't take into account already computed tokens.

This would result in under-utilization of KV cache, and un optimal scheduling decision for a batch.

For more details, see the vllm-project#7883

High Level Approach

This PR addresses the issue by:

Make sequence allocation prefix cache aware: When calculating how many blocks is needed for a prefill sequence, the block manager would try to figure out given the to-be-prefilled tokens, what's the longest prefix that's already computed, and exclude the already computed blocks to achieve a higher KV cache utilization.
Make scheduling prefix cache aware: When deciding how many tokens to schedule, the already cached tokens are now also taken into account, and the scheduling budget now includes cached tokens.

On a high level, there are below major changes.

Block hash now is no longer a computed property of the block, but a computed property of a sequence. This is so that before a sequence is allocated, we would also be able to have information of the block hash. If it's coupled with the block allocator, it's hard to obtain the block hash before allocation happens.
The scheduler computes how many of new tokens are already cached by querying the block manager for the number of cached blocks given a sequence, and only take into the uncached tokens for scheduling decisions.

Benchmark

With the PR

Almost always 90%+ KV Cache utilization regardless of the prefix caching rate. (Higher risk of running into preemption too)
Throughput could be increased as much as 20% - 30% when prefix sharing is high (70%) on prefill dominant workloads. See this document for more benchmarking details analysis
ITL and TTFT increased as much as XXX

It's worth noting that:

The benefit introduced in the PR will be less significant if decoding takes up greater portion.
One could potentially set a high enough max_num_batched_tokens to take into account prefix-cached tokens if the prefix cache hit rate could be known in advance.

github-actions · 2024-10-30T22:11:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

rickyyx added 8 commits October 10, 2024 23:39

Tests passing

a5875e8

staging

add282a

merged

c9b1cd3

fix

d2303db

fix

fc3d044

up

aac5ae6

fix

661b890

fix issues with chunked prefill running schedule

2cffdf9

rickyyx added 5 commits October 30, 2024 23:06

clean up

0321fa7

up

f8b488d

up

4df83d0

up

417760a

lint

8d8853e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix Cache Aware Scheduling #1

Prefix Cache Aware Scheduling #1

rickyyx commented Oct 30, 2024 •

edited

Loading

github-actions bot commented Oct 30, 2024

Prefix Cache Aware Scheduling #1

Are you sure you want to change the base?

Prefix Cache Aware Scheduling #1

Conversation

rickyyx commented Oct 30, 2024 • edited Loading

Problem and Motivation

High Level Approach

Benchmark

github-actions bot commented Oct 30, 2024

rickyyx commented Oct 30, 2024 •

edited

Loading