LFX-Proposal: Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs #156

FuryMartin · 2024-10-14T15:21:16Z

What type of PR is this?
/kind design

What this PR does / why we need it:

Proposal for LFX Project CNCF - KubeEdge: Cloud-Edge Speculative Decoding for LLM via KubeEdge-Ianvs

Which issue(s) this PR fixes:

Fixes #126

MooreZheng

Got to fix the below CI errors for Pylint (3.9) before further actions, see CI logs

Run pylint '/home/runner/work/ianvs/ianvs/core'
core/testenvmanager/dataset/dataset.py:119:4: R0917: Too many positional arguments (8/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:206:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:213:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:246:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:285:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:329:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:368:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments)
************* Module core.testcasecontroller.algorithm.paradigm.singletask_learning.singletask_learning_active_boost
core/testcasecontroller/algorithm/paradigm/singletask_learning/singletask_learning_active_boost.py:66:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)

-----------------------------------
Your code has been rated at 9.95/10

Error: Process completed with exit code 8.

FuryMartin · 2024-10-24T05:25:09Z

Got to fix the below CI errors for Pylint (3.9) before further actions, see CI logs

This is fixed by #158

hsj576

The implementation of cloud-edge collaborative speculative decoding in the proposal needs to be further refined. According to what we discussed in the regular community meeting, speculative decoding can be implemented in the cloud by still adopting the hard example mining paradigm on the edge side.

MooreZheng

Overall it looks fine to me. Might need to highlight the difference against the OSPP proposal

hsj576 · 2024-11-28T09:39:06Z

It is necessary to highlight why we use speculative decoding to accelerate LLM cloud-edge collaborative inference in the motivation section. The differences between this proposal and OSPP proposal should be further highlighted in the methods section.

MooreZheng

Overall it looks fine. As discussed at the routine meeting, there are a few points yet to be achieved.

The motivation for using the technique is not quite clear. I believe that it would make sense the improve the inference time. But the reason and potential of solving this problem could be further explored, e.g., by adding examples.
Need to highlight the difference against the existing design.

FuryMartin · 2024-12-02T10:01:48Z

I have refined the proposal.

The motivation for using the technique is not quite clear. I believe that it would make sense the improve the inference time. But the reason and potential of solving this problem could be further explored, e.g., by adding examples.

For the motivation part, I explained the necessity of improving inference speed for long-context chatbot scenario and LLM-based Agents scenario.

Need to highlight the difference against the existing design.

For the difference against Query-Routing, I add a Highlight Section and highlight the new modules in the overall architecture picture.

FuryMartin · 2024-12-02T10:17:23Z

By the way, as an intuitive presentation, I currently tested the inference speed of Qwen2-7B-Instruct on different inference frameworks running on a single RTX 4090 GPU.

However, this is just for preliminary experimentation. I think it is not appropriate to include in the Proposal and am posting it here only as a preview of results.

Serving Engine	Time to First Token (s)	Internal Token Latency (s)	Throughput (tokens/s)	Speed Up
transformers	1.4382	0.0202	49.60	1.00x
vLLM	0.0676	0.0168	59.54	1.20x
EAGLE	0.1918	0.0076	131.80	2.66x

EAGLE is a speculative decoding framework, tested with draft model yuhuili/EAGLE-Qwen2-7B-Instruct.

Notice that EAGLE's inference speed is more than two times higher than transformers, showcasing the huge potential for speculative decoding.

MooreZheng

Good job. We see that the revised version is much improved in the proposal where the inference-time saving is discussed and the difference against the existing design is highlighted.

Nevertheless, the quantification result is not included in the proposal, and using adjectives is not enough to justify the motivation. Is there any clue, e.g., about how much improvement could be potentially provided or how fast is sufficient for an LLM query?

FuryMartin · 2024-12-04T06:41:54Z

Good job. We see that the revised version is much improved in the proposal where the inference-time saving is discussed and the difference against the existing design is highlighted.

Nevertheless, the quantification result is not included in the proposal, and using adjectives is not enough to justify the motivation. Is there any clue, e.g., about how much improvement could be potentially provided or how fast is sufficient for an LLM query?

Thanks for the review.

Serving Engine Time to First Token (s) Internal Token Latency (s) Throughput (tokens/s) Speed Up

transformers 1.4382 0.0202 49.60 1.00x

vLLM 0.0676 0.0168 59.54 1.20x

EAGLE 0.1918 0.0076 131.80 2.66x

Are the preliminary test results I mentioned in my previous comment what you want as quantified results?

If you think they are okay, I will include them in the proposal.

FuryMartin · 2024-12-05T05:58:03Z

I just added the following paragraph into the proposal, which showcases the practice and effects of Speculative Decoding technique in OpenAI's products.

On November 6, 2024, OpenAI introduced an innovative feature named Predicted Outputs, which is capable of accelerating GPT-4o generation by 2-4 times while maintaining accuracy. This remarkable capability is rooted in the concept of Speculative Decoding, showcasing its immense potential to enhance the inference speed of Large Language Models (LLMs).

Is this sufficient support for motivation, thereby justifying the necessity of Speculative Decoding?

MooreZheng

I just added the following paragraph into the proposal, which showcases the practice and effects of Speculative Decoding technique in OpenAI's products.

On November 6, 2024, OpenAI introduced an innovative feature named Predicted Outputs, which is capable of accelerating GPT-4o generation by 2-4 times while maintaining accuracy. This remarkable capability is rooted in the concept of Speculative Decoding, showcasing its immense potential to enhance the inference speed of Large Language Models (LLMs).

Is this sufficient support for motivation, thereby justifying the necessity of Speculative Decoding?

That would also be quite a good point for motivation. A proposal with techniques used in real-world scenarios will definitely strengthen the motivation.

Besides, the previous quantified results can be further improved using more real-world examples. Showing high throughput does not necessarily mean productivity boost, unless we prove that the current condition with a low throughput is not acceptable.

To finally form a closed loop for the logical chain, we still need an example. That is, given an LLM-produced article with xxx words, the inference latency can drop from yyy seconds to zzz seconds with the proposed technique, showing a xxxxx% improvement.

Signed-off-by: Yu Fan <[email protected]> doc: modify architecture diagram Signed-off-by: Yu Fan <[email protected]> doc: add explanation of necessity for faster inference speed; highlight the difference from query-routing Signed-off-by: Yu Fan <[email protected]> doc: add quantifacation results from preliminary experiments; add OpenAI practice of Speculative Decoding; add a gif of lookahead-decoding Signed-off-by: Yu Fan <[email protected]>

FuryMartin · 2024-12-05T10:01:41Z

I just added the following paragraph into the proposal, which showcases the practice and effects of Speculative Decoding technique in OpenAI's products.

On November 6, 2024, OpenAI introduced an innovative feature named Predicted Outputs, which is capable of accelerating GPT-4o generation by 2-4 times while maintaining accuracy. This remarkable capability is rooted in the concept of Speculative Decoding, showcasing its immense potential to enhance the inference speed of Large Language Models (LLMs).

Is this sufficient support for motivation, thereby justifying the necessity of Speculative Decoding?

That would also be quite a good point for motivation. A proposal with techniques used in real-world scenarios will definitely strengthen the motivation.

Besides, the previous quantified results can be further improved using more real-world examples. Showing high throughput does not necessarily mean productivity boost, unless we prove that the current condition with a low throughput is not acceptable.

To finally form a closed loop for the logical chain, we still need an example. That is, given an LLM-produced article with xxx words, the inference latency can drop from yyy seconds to zzz seconds with the proposed technique, showing a xxxxx% improvement.

Sure, I have refined my proposal according to maintainers' comments on today's regular community meeting.

MooreZheng · 2024-12-05T10:17:58Z

/lgtm

hsj576 · 2024-12-06T05:07:02Z

/lgtm

MooreZheng

All concerns from reviewers are tackled. Well done!

MooreZheng · 2024-12-06T07:45:49Z

/approve

kubeedge-bot · 2024-12-06T07:46:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MooreZheng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [MooreZheng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kubeedge-bot added the kind/design Categorizes issue or PR as related to design. label Oct 14, 2024

kubeedge-bot requested review from jaypume and MooreZheng October 14, 2024 15:21

kubeedge-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 14, 2024

MooreZheng requested review from hsj576 and removed request for jaypume October 15, 2024 03:08

MooreZheng assigned FuryMartin and hsj576 Oct 15, 2024

MooreZheng requested changes Oct 15, 2024

View reviewed changes

kubeedge-bot assigned MooreZheng Oct 15, 2024

FuryMartin force-pushed the lfx-proposal branch from 27ab314 to d019f65 Compare October 24, 2024 05:23

hsj576 suggested changes Oct 26, 2024

View reviewed changes

FuryMartin force-pushed the lfx-proposal branch from d019f65 to 7c0e001 Compare November 7, 2024 08:21

kubeedge-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 7, 2024

FuryMartin force-pushed the lfx-proposal branch from 7c0e001 to 6e86e1a Compare November 7, 2024 08:23

FuryMartin changed the title ~~Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs~~ LFX-Proposal: Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs Nov 7, 2024

MooreZheng requested changes Nov 7, 2024

View reviewed changes

FuryMartin force-pushed the lfx-proposal branch 2 times, most recently from 25fcccb to 746c70d Compare November 28, 2024 07:44

MooreZheng requested changes Nov 28, 2024

View reviewed changes

FuryMartin force-pushed the lfx-proposal branch from 746c70d to e28091b Compare December 2, 2024 09:50

FuryMartin force-pushed the lfx-proposal branch from e28091b to ef37092 Compare December 2, 2024 10:05

FuryMartin requested a review from MooreZheng December 3, 2024 11:29

MooreZheng requested changes Dec 4, 2024

View reviewed changes

FuryMartin requested a review from MooreZheng December 4, 2024 06:42

FuryMartin force-pushed the lfx-proposal branch from ef37092 to 0dfdbd7 Compare December 5, 2024 05:52

MooreZheng requested changes Dec 5, 2024

View reviewed changes

FuryMartin force-pushed the lfx-proposal branch from 0dfdbd7 to d847b4b Compare December 5, 2024 09:57

FuryMartin requested a review from MooreZheng December 5, 2024 10:01

kubeedge-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 5, 2024

MooreZheng approved these changes Dec 6, 2024

View reviewed changes

kubeedge-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2024

kubeedge-bot merged commit aefdbeb into kubeedge:main Dec 6, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LFX-Proposal: Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs #156

LFX-Proposal: Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs #156

FuryMartin commented Oct 14, 2024

MooreZheng left a comment •

edited

Loading

FuryMartin commented Oct 24, 2024

hsj576 left a comment

MooreZheng left a comment

hsj576 commented Nov 28, 2024

MooreZheng left a comment

FuryMartin commented Dec 2, 2024

FuryMartin commented Dec 2, 2024 •

edited

Loading

MooreZheng left a comment

FuryMartin commented Dec 4, 2024 •

edited

Loading

FuryMartin commented Dec 5, 2024

MooreZheng left a comment •

edited

Loading

FuryMartin commented Dec 5, 2024

MooreZheng commented Dec 5, 2024

hsj576 commented Dec 6, 2024

MooreZheng left a comment

MooreZheng commented Dec 6, 2024

kubeedge-bot commented Dec 6, 2024

LFX-Proposal: Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs #156

LFX-Proposal: Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs #156

Conversation

FuryMartin commented Oct 14, 2024

MooreZheng left a comment • edited Loading

Choose a reason for hiding this comment

FuryMartin commented Oct 24, 2024

hsj576 left a comment

Choose a reason for hiding this comment

MooreZheng left a comment

Choose a reason for hiding this comment

hsj576 commented Nov 28, 2024

MooreZheng left a comment

Choose a reason for hiding this comment

FuryMartin commented Dec 2, 2024

FuryMartin commented Dec 2, 2024 • edited Loading

MooreZheng left a comment

Choose a reason for hiding this comment

FuryMartin commented Dec 4, 2024 • edited Loading

FuryMartin commented Dec 5, 2024

MooreZheng left a comment • edited Loading

Choose a reason for hiding this comment

FuryMartin commented Dec 5, 2024

MooreZheng commented Dec 5, 2024

hsj576 commented Dec 6, 2024

MooreZheng left a comment

Choose a reason for hiding this comment

MooreZheng commented Dec 6, 2024

kubeedge-bot commented Dec 6, 2024

MooreZheng left a comment •

edited

Loading

FuryMartin commented Dec 2, 2024 •

edited

Loading

FuryMartin commented Dec 4, 2024 •

edited

Loading

MooreZheng left a comment •

edited

Loading