-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: Prefix-caching aware scheduling #7883
Comments
Overall it makes sense to me, since the two strategies PC + CP are combined together, a more powerful customized scheduler is on the road naturally. Will take some time dig into it. |
1.b is not specific to CP, is this correct? I think the proposal makes a lot of sense |
Yes 1.b is a general problem. |
I’ve been looking into potential ways to implement this feature and wanted to share some notes and proposals for discussion. Please feel free to correct any misunderstandings. BackgroundThe current prefix cache reuse operates at the allocator/block level, specifically:
IssuesWhile the existing system effectively handles cache reuse transparently for upper components, several issues arise with new features:
ProposalTo make memory management more prefix-aware, I propose the following principle:
While the block could still store this information, it should be treated as an external property (e.g., passed through constructors/setters) rather than being computed within the block. In fact, this seems similar to how the v1 block manager worked. Potential Implementation:
|
Thanks for the write up and it's exactly what I'm thinking. Regarding (1) that stores in the sequence similar to block manager v1, I don't know why we didn't do that in v2. @youkaichao @rkooo567 @cadedaniel can any of you provide some design rationales? Thanks |
some thoughts that I hope are helpful!
|
Thanks for the useful information! We will take a deeper look at CoW and swapping. For LoRA ID, it's doable but needs some refactoring, as LoRA ID currently cannot be directly obtained by block manager. For the extended API that returns number of new blocks, yes that's one option. The concern here is that we need to use prompt IDs to query this API and it needs to compute the hashes, which becomes an additional hash computation. Maybe we could let the new API return hashes and pass to another new API such as allocate_by_hash(). |
Should we close this given the release is out? I think there could be some further optimizations in v0:
|
Proposal to improve performance
The current execution flow with prefix caching is as follows:
a. Calculate how many blocks it needs.
b. Check whether we have sufficient number of blocks in the block manager.
c. If so, determine the number of tokens to be prefilled in this batch (it is equal to the prompt length without chunked prefill, or at maximum the chunked size otherwise).
d. Update the batch token budget by subtracting the tokens to be prefilled.
e. Allocate all (regardless how many tokens to prefill in this batch) blocks.
f. Match allocated block IDs with prefix cache, and list them in
computed_block_nums
.a. Get the number of tokens to prefill for this sequence in this batch.
b. Setup input token IDs and positions.
c. If
computed_block_nums
is not none, then remove the cached tokens from input tokens, and adjust input positions, query length and context length accordingly.The inefficiencies are then:
The above inefficiencies come from the fact that we know which blocks are cached starting from Step 1.f. Thus, we propose the following changes:
can_allocate
in block manager at Step 1.b to consider prefix caching. For example, although a sequence needs 16 blocks for its prompt,can_allocate
could still return True even the block manager only has 4 blocks left when 12 of 16 blocks are already cached.cc @rkooo567 @sighingnow @Juelianqvq @zhuohan123
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: