-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LFX-Proposal: Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs #156
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got to fix the below CI errors for Pylint (3.9) before further actions, see CI logs
Run pylint '/home/runner/work/ianvs/ianvs/core'
core/testenvmanager/dataset/dataset.py:119:4: R0917: Too many positional arguments (8/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:206:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:213:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:246:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:285:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:329:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:368:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments)
************* Module core.testcasecontroller.algorithm.paradigm.singletask_learning.singletask_learning_active_boost
core/testcasecontroller/algorithm/paradigm/singletask_learning/singletask_learning_active_boost.py:66:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
-----------------------------------
Your code has been rated at 9.95/10
Error: Process completed with exit code 8.
27ab314
to
d019f65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of cloud-edge collaborative speculative decoding in the proposal needs to be further refined. According to what we discussed in the regular community meeting, speculative decoding can be implemented in the cloud by still adopting the hard example mining paradigm on the edge side.
d019f65
to
7c0e001
Compare
7c0e001
to
6e86e1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks fine to me. Might need to highlight the difference against the OSPP proposal
25fcccb
to
746c70d
Compare
It is necessary to highlight why we use speculative decoding to accelerate LLM cloud-edge collaborative inference in the motivation section. The differences between this proposal and OSPP proposal should be further highlighted in the methods section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks fine. As discussed at the routine meeting, there are a few points yet to be achieved.
- The motivation for using the technique is not quite clear. I believe that it would make sense the improve the inference time. But the reason and potential of solving this problem could be further explored, e.g., by adding examples.
- Need to highlight the difference against the existing design.
746c70d
to
e28091b
Compare
I have refined the proposal.
For the motivation part, I explained the necessity of improving inference speed for long-context chatbot scenario and LLM-based Agents scenario.
For the difference against Query-Routing, I add a Highlight Section and highlight the new modules in the overall architecture picture. |
e28091b
to
ef37092
Compare
By the way, as an intuitive presentation, I currently tested the inference speed of Qwen2-7B-Instruct on different inference frameworks running on a single RTX 4090 GPU. However, this is just for preliminary experimentation. I think it is not appropriate to include in the Proposal and am posting it here only as a preview of results.
Notice that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job. We see that the revised version is much improved in the proposal where the inference-time saving is discussed and the difference against the existing design is highlighted.
Nevertheless, the quantification result is not included in the proposal, and using adjectives is not enough to justify the motivation. Is there any clue, e.g., about how much improvement could be potentially provided or how fast is sufficient for an LLM query?
Thanks for the review.
Are the preliminary test results I mentioned in my previous comment what you want as quantified results? If you think they are okay, I will include them in the proposal. |
ef37092
to
0dfdbd7
Compare
I just added the following paragraph into the proposal, which showcases the practice and effects of Speculative Decoding technique in OpenAI's products.
Is this sufficient support for motivation, thereby justifying the necessity of Speculative Decoding? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added the following paragraph into the proposal, which showcases the practice and effects of Speculative Decoding technique in OpenAI's products.
On November 6, 2024, OpenAI introduced an innovative feature named Predicted Outputs, which is capable of accelerating GPT-4o generation by 2-4 times while maintaining accuracy. This remarkable capability is rooted in the concept of Speculative Decoding, showcasing its immense potential to enhance the inference speed of Large Language Models (LLMs).
Is this sufficient support for motivation, thereby justifying the necessity of Speculative Decoding?
That would also be quite a good point for motivation. A proposal with techniques used in real-world scenarios will definitely strengthen the motivation.
Besides, the previous quantified results can be further improved using more real-world examples. Showing high throughput does not necessarily mean productivity boost, unless we prove that the current condition with a low throughput is not acceptable.
To finally form a closed loop for the logical chain, we still need an example. That is, given an LLM-produced article with xxx words, the inference latency can drop from yyy seconds to zzz seconds with the proposed technique, showing a xxxxx% improvement.
Signed-off-by: Yu Fan <[email protected]> doc: modify architecture diagram Signed-off-by: Yu Fan <[email protected]> doc: add explanation of necessity for faster inference speed; highlight the difference from query-routing Signed-off-by: Yu Fan <[email protected]> doc: add quantifacation results from preliminary experiments; add OpenAI practice of Speculative Decoding; add a gif of lookahead-decoding Signed-off-by: Yu Fan <[email protected]>
0dfdbd7
to
d847b4b
Compare
Sure, I have refined my proposal according to maintainers' comments on today's regular community meeting. |
/lgtm |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All concerns from reviewers are tackled. Well done!
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MooreZheng The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind design
What this PR does / why we need it:
Proposal for LFX Project CNCF - KubeEdge: Cloud-Edge Speculative Decoding for LLM via KubeEdge-Ianvs
Which issue(s) this PR fixes:
Fixes #126