Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: Prefix-caching aware scheduling #7883

Closed
1 task done
comaniac opened this issue Aug 26, 2024 · 8 comments · May be fixed by rickyyx/vllm#1
Closed
1 task done

[Performance]: Prefix-caching aware scheduling #7883

comaniac opened this issue Aug 26, 2024 · 8 comments · May be fixed by rickyyx/vllm#1
Labels
help wanted Extra attention is needed performance Performance-related issues

Comments

@comaniac
Copy link
Collaborator

comaniac commented Aug 26, 2024

Proposal to improve performance

The current execution flow with prefix caching is as follows:

  1. Scheduler takes the next prefill sequence:
    a. Calculate how many blocks it needs.
    b. Check whether we have sufficient number of blocks in the block manager.
    c. If so, determine the number of tokens to be prefilled in this batch (it is equal to the prompt length without chunked prefill, or at maximum the chunked size otherwise).
    d. Update the batch token budget by subtracting the tokens to be prefilled.
    e. Allocate all (regardless how many tokens to prefill in this batch) blocks.
    f. Match allocated block IDs with prefix cache, and list them in computed_block_nums.
  2. Prepare input:
    a. Get the number of tokens to prefill for this sequence in this batch.
    b. Setup input token IDs and positions.
    c. If computed_block_nums is not none, then remove the cached tokens from input tokens, and adjust input positions, query length and context length accordingly.
  3. Execute the model.

The inefficiencies are then:

  1. In Step 1.b, we now don't consider prefix caching. Taking a sequence with 16 blocks in prompt as an example, it now requires block manager to have 16 free blocks to be scheduled. However, assuming 12 of 16 blocks are already cached, we actually only need free 4 blocks to schedule this sequence.
  2. In Step 1.d, we now don't consider prefix caching. Assuming the number of batched tokens is set to 2048, and we scheduled 2 sequences with 1024 tokens each. However, if the first 512 prefix tokens are already cached, then the batch size is actually 1024 instead of 2048.

The above inefficiencies come from the fact that we know which blocks are cached starting from Step 1.f. Thus, we propose the following changes:

  1. Improve can_allocate in block manager at Step 1.b to consider prefix caching. For example, although a sequence needs 16 blocks for its prompt, can_allocate could still return True even the block manager only has 4 blocks left when 12 of 16 blocks are already cached.
  2. Improve Step 1.c in the scheduler to consider prefix caching. Specifically, this step should guarantee the number of new tokens to prefill are not cached. If an entire prompt of a sequence is cached, we should only compute the last token.

cc @rkooo567 @sighingnow @Juelianqvq @zhuohan123

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@comaniac comaniac added help wanted Extra attention is needed performance Performance-related issues labels Aug 26, 2024
@Juelianqvq
Copy link
Contributor

Overall it makes sense to me, since the two strategies PC + CP are combined together, a more powerful customized scheduler is on the road naturally. Will take some time dig into it.

@rkooo567
Copy link
Collaborator

1.b is not specific to CP, is this correct? I think the proposal makes a lot of sense

@comaniac
Copy link
Collaborator Author

Yes 1.b is a general problem.

@rickyyx
Copy link
Contributor

rickyyx commented Oct 4, 2024

I’ve been looking into potential ways to implement this feature and wanted to share some notes and proposals for discussion. Please feel free to correct any misunderstandings.

Background

The current prefix cache reuse operates at the allocator/block level, specifically:

  1. When an immutable block (i.e., a full block) is created or promoted in the allocator, its hash is computed based on two factors:
    1. The token IDs, and
    2. The block's index in the sequence.
  2. Blocks with the same hash are reused during allocation/creation in a way that is transparent to the caller (block manager/block table).

Issues

While the existing system effectively handles cache reuse transparently for upper components, several issues arise with new features:

  1. Additional Metadata for Cache Reuse:
    To handle recent features, more metadata beyond the block index and token IDs should be considered for cache reuse:

    • LoRA Request ID: To properly support LoRA, the request ID should also be included in the block hash calculation. This allows differentiation of key-value (KV) caches generated by different LoRA adaptors, even when sequences have the same token IDs.
    • Encoder/Decoder Sequences: For encoder-decoder models, we likely need to differentiate between encoder and decoder sequences, even if they contain the same token IDs. It’s possible the current implementation assumes that decoder sequences will always differ from encoder sequences, but this might need further investigation.
  2. Lack of Cache Awareness for Scheduling:
    Currently, scheduling cannot determine whether block caches are being reused for a sequence. This lack of visibility makes it impossible to assess whether cache reuse is possible (which seems to be the root cause of the issue mentioned earlier).

Proposal

To make memory management more prefix-aware, I propose the following principle:

The block hash should be treated as metadata associated with the sequence (or block table), rather than being tied to the allocator or the block itself.

While the block could still store this information, it should be treated as an external property (e.g., passed through constructors/setters) rather than being computed within the block. In fact, this seems similar to how the v1 block manager worked.

Potential Implementation:

  1. Compute the block hash once and store it as part of the Sequence.
  2. Provide an API to set the block hash on a block, rather than computing it within PrefixCachingBlock.
  3. When scheduling a sequence, use the block’s stored hash to look up any existing blocks, allowing for efficient scheduling based on cache reuse potential.

@comaniac
Copy link
Collaborator Author

comaniac commented Oct 4, 2024

Thanks for the write up and it's exactly what I'm thinking. Regarding (1) that stores in the sequence similar to block manager v1, I don't know why we didn't do that in v2. @youkaichao @rkooo567 @cadedaniel can any of you provide some design rationales? Thanks

@cadedaniel
Copy link
Collaborator

some thoughts that I hope are helpful!

  • storing the hash info in Sequence is a valid design choice. I was unable to create a design that worked with the requirements CoW x prefix caching x lookahead x swap x sliding window that didn't completely blow up the sequence class. But it could be doable if you drop requirements (seems you can remove CoW and Swapping?). The whole reason the block manager v2 has such design sophistication is because it became very difficult to work in v1 with the many different requirements, leading to bugs and impossible to extend code.
  • for LoRA, why not just add the Lora ID to the hash function? I think this will compose well with existing prefix caching system, but I might miss some edge cases
  • for EncDec, I think the cache characteristics are different enough to warrant a completely different block manager implementation.
  • if I were doing this proposal and didn't want to rewrite the block management system, I would add an API to the block manager which queries how many new blocks would be required for a given sequence to be scheduled. Then you can use this to improve scheduling to include prefix caching hints. I believe this shouldn't be that hard but I might be missing some edge case.

@comaniac
Copy link
Collaborator Author

comaniac commented Oct 6, 2024

Thanks for the useful information! We will take a deeper look at CoW and swapping.

For LoRA ID, it's doable but needs some refactoring, as LoRA ID currently cannot be directly obtained by block manager.

For the extended API that returns number of new blocks, yes that's one option. The concern here is that we need to use prompt IDs to query this API and it needs to compute the hashes, which becomes an additional hash computation. Maybe we could let the new API return hashes and pass to another new API such as allocate_by_hash().

@rickyyx
Copy link
Contributor

rickyyx commented Dec 20, 2024

Should we close this given the release is out?

I think there could be some further optimizations in v0:

  • we are still computing block hashes twice (once when adding the request to tracker, once when allocating the block) though -> this could be a different issue from this.
  • we are still not being prefix cache aware when allocating blocks so there's some under utilization of kv cache (this seems to be fine since it only happens when the KV cache runs out, and the under utilization actually sometimes prevents preemption to happen).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed performance Performance-related issues
Projects
None yet
8 participants
@cadedaniel @comaniac @rickyyx @rkooo567 @Juelianqvq and others