Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix Cache Aware Scheduling #1

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Draft

Prefix Cache Aware Scheduling #1

wants to merge 13 commits into from

Conversation

rickyyx
Copy link
Owner

@rickyyx rickyyx commented Oct 30, 2024

FILL IN THE PR DESCRIPTION HERE

FIX vllm-project#7883 in V0

Problem and Motivation

With current impl in main, there are at least 2 places where scheduling is not optimal:

  1. When deciding if a sequence could be allocated for prefill, it doesn't take into account the already computed blocks.
  2. When deciding how many new tokens should be used for scheduling budget, it doesn't take into account already computed tokens.

This would result in under-utilization of KV cache, and un optimal scheduling decision for a batch.

For more details, see the vllm-project#7883

High Level Approach

This PR addresses the issue by:

  1. Make sequence allocation prefix cache aware: When calculating how many blocks is needed for a prefill sequence, the block manager would try to figure out given the to-be-prefilled tokens, what's the longest prefix that's already computed, and exclude the already computed blocks to achieve a higher KV cache utilization.
  2. Make scheduling prefix cache aware: When deciding how many tokens to schedule, the already cached tokens are now also taken into account, and the scheduling budget now includes cached tokens.

On a high level, there are below major changes.

  1. Block hash now is no longer a computed property of the block, but a computed property of a sequence. This is so that before a sequence is allocated, we would also be able to have information of the block hash. If it's coupled with the block allocator, it's hard to obtain the block hash before allocation happens.
  2. The scheduler computes how many of new tokens are already cached by querying the block manager for the number of cached blocks given a sequence, and only take into the uncached tokens for scheduling decisions.

Benchmark

With the PR

  • Almost always 90%+ KV Cache utilization regardless of the prefix caching rate. (Higher risk of running into preemption too)
  • Throughput could be increased as much as 20% - 30% when prefix sharing is high (70%) on prefill dominant workloads. See this document for more benchmarking details analysis
  • ITL and TTFT increased as much as XXX

It's worth noting that:

  1. The benefit introduced in the PR will be less significant if decoding takes up greater portion.
  2. One could potentially set a high enough max_num_batched_tokens to take into account prefix-cached tokens if the prefix cache hit rate could be known in advance.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Performance]: Prefix-caching aware scheduling
1 participant