Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LFX-Proposal: Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs #156

Merged
merged 1 commit into from
Dec 6, 2024

Conversation

FuryMartin
Copy link
Contributor

What type of PR is this?
/kind design

What this PR does / why we need it:

Proposal for LFX Project CNCF - KubeEdge: Cloud-Edge Speculative Decoding for LLM via KubeEdge-Ianvs

Which issue(s) this PR fixes:

Fixes #126

@kubeedge-bot kubeedge-bot added the kind/design Categorizes issue or PR as related to design. label Oct 14, 2024
@kubeedge-bot kubeedge-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 14, 2024
@MooreZheng MooreZheng requested review from hsj576 and removed request for jaypume October 15, 2024 03:08
Copy link
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got to fix the below CI errors for Pylint (3.9) before further actions, see CI logs

Run pylint '/home/runner/work/ianvs/ianvs/core'
core/testenvmanager/dataset/dataset.py:119:4: R0917: Too many positional arguments (8/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:206:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:213:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:246:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:285:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:329:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)
core/testenvmanager/dataset/dataset.py:368:4: R0917: Too many positional arguments (6/5) (too-many-positional-arguments)
************* Module core.testcasecontroller.algorithm.paradigm.singletask_learning.singletask_learning_active_boost
core/testcasecontroller/algorithm/paradigm/singletask_learning/singletask_learning_active_boost.py:66:4: R0917: Too many positional arguments (7/5) (too-many-positional-arguments)

-----------------------------------
Your code has been rated at 9.95/10

Error: Process completed with exit code 8.

@FuryMartin
Copy link
Contributor Author

Got to fix the below CI errors for Pylint (3.9) before further actions, see CI logs

This is fixed by #158

Copy link
Member

@hsj576 hsj576 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of cloud-edge collaborative speculative decoding in the proposal needs to be further refined. According to what we discussed in the regular community meeting, speculative decoding can be implemented in the cloud by still adopting the hard example mining paradigm on the edge side.

@kubeedge-bot kubeedge-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 7, 2024
@FuryMartin FuryMartin changed the title Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs LFX-Proposal: Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs Nov 7, 2024
Copy link
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks fine to me. Might need to highlight the difference against the OSPP proposal

@FuryMartin FuryMartin force-pushed the lfx-proposal branch 2 times, most recently from 25fcccb to 746c70d Compare November 28, 2024 07:44
@hsj576
Copy link
Member

hsj576 commented Nov 28, 2024

It is necessary to highlight why we use speculative decoding to accelerate LLM cloud-edge collaborative inference in the motivation section. The differences between this proposal and OSPP proposal should be further highlighted in the methods section.

Copy link
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks fine. As discussed at the routine meeting, there are a few points yet to be achieved.

  1. The motivation for using the technique is not quite clear. I believe that it would make sense the improve the inference time. But the reason and potential of solving this problem could be further explored, e.g., by adding examples.
  2. Need to highlight the difference against the existing design.

@FuryMartin
Copy link
Contributor Author

I have refined the proposal.

  1. The motivation for using the technique is not quite clear. I believe that it would make sense the improve the inference time. But the reason and potential of solving this problem could be further explored, e.g., by adding examples.

For the motivation part, I explained the necessity of improving inference speed for long-context chatbot scenario and LLM-based Agents scenario.

  1. Need to highlight the difference against the existing design.

For the difference against Query-Routing, I add a Highlight Section and highlight the new modules in the overall architecture picture.

@FuryMartin
Copy link
Contributor Author

FuryMartin commented Dec 2, 2024

By the way, as an intuitive presentation, I currently tested the inference speed of Qwen2-7B-Instruct on different inference frameworks running on a single RTX 4090 GPU.

However, this is just for preliminary experimentation. I think it is not appropriate to include in the Proposal and am posting it here only as a preview of results.

Serving Engine Time to First Token (s) Internal Token Latency (s) Throughput (tokens/s) Speed Up
transformers 1.4382 0.0202 49.60 1.00x
vLLM 0.0676 0.0168 59.54 1.20x
EAGLE 0.1918 0.0076 131.80 2.66x

EAGLE is a speculative decoding framework, tested with draft model yuhuili/EAGLE-Qwen2-7B-Instruct.

Notice that EAGLE's inference speed is more than two times higher than transformers, showcasing the huge potential for speculative decoding.

@FuryMartin FuryMartin requested a review from MooreZheng December 3, 2024 11:29
Copy link
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. We see that the revised version is much improved in the proposal where the inference-time saving is discussed and the difference against the existing design is highlighted.

Nevertheless, the quantification result is not included in the proposal, and using adjectives is not enough to justify the motivation. Is there any clue, e.g., about how much improvement could be potentially provided or how fast is sufficient for an LLM query?

@FuryMartin
Copy link
Contributor Author

FuryMartin commented Dec 4, 2024

Good job. We see that the revised version is much improved in the proposal where the inference-time saving is discussed and the difference against the existing design is highlighted.

Nevertheless, the quantification result is not included in the proposal, and using adjectives is not enough to justify the motivation. Is there any clue, e.g., about how much improvement could be potentially provided or how fast is sufficient for an LLM query?

Thanks for the review.

Serving Engine Time to First Token (s) Internal Token Latency (s) Throughput (tokens/s) Speed Up
transformers 1.4382 0.0202 49.60 1.00x
vLLM 0.0676 0.0168 59.54 1.20x
EAGLE 0.1918 0.0076 131.80 2.66x

Are the preliminary test results I mentioned in my previous comment what you want as quantified results?

If you think they are okay, I will include them in the proposal.

@FuryMartin
Copy link
Contributor Author

I just added the following paragraph into the proposal, which showcases the practice and effects of Speculative Decoding technique in OpenAI's products.

On November 6, 2024, OpenAI introduced an innovative feature named Predicted Outputs, which is capable of accelerating GPT-4o generation by 2-4 times while maintaining accuracy. This remarkable capability is rooted in the concept of Speculative Decoding, showcasing its immense potential to enhance the inference speed of Large Language Models (LLMs).

Is this sufficient support for motivation, thereby justifying the necessity of Speculative Decoding?

Copy link
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added the following paragraph into the proposal, which showcases the practice and effects of Speculative Decoding technique in OpenAI's products.

On November 6, 2024, OpenAI introduced an innovative feature named Predicted Outputs, which is capable of accelerating GPT-4o generation by 2-4 times while maintaining accuracy. This remarkable capability is rooted in the concept of Speculative Decoding, showcasing its immense potential to enhance the inference speed of Large Language Models (LLMs).

Is this sufficient support for motivation, thereby justifying the necessity of Speculative Decoding?

That would also be quite a good point for motivation. A proposal with techniques used in real-world scenarios will definitely strengthen the motivation.

Besides, the previous quantified results can be further improved using more real-world examples. Showing high throughput does not necessarily mean productivity boost, unless we prove that the current condition with a low throughput is not acceptable.

To finally form a closed loop for the logical chain, we still need an example. That is, given an LLM-produced article with xxx words, the inference latency can drop from yyy seconds to zzz seconds with the proposed technique, showing a xxxxx% improvement.

Signed-off-by: Yu Fan <[email protected]>

doc: modify architecture diagram

Signed-off-by: Yu Fan <[email protected]>

doc: add explanation of necessity for faster inference speed; highlight the difference from query-routing

Signed-off-by: Yu Fan <[email protected]>

doc: add quantifacation results from preliminary experiments; add OpenAI practice of Speculative Decoding; add a gif of lookahead-decoding

Signed-off-by: Yu Fan <[email protected]>
@FuryMartin
Copy link
Contributor Author

I just added the following paragraph into the proposal, which showcases the practice and effects of Speculative Decoding technique in OpenAI's products.

On November 6, 2024, OpenAI introduced an innovative feature named Predicted Outputs, which is capable of accelerating GPT-4o generation by 2-4 times while maintaining accuracy. This remarkable capability is rooted in the concept of Speculative Decoding, showcasing its immense potential to enhance the inference speed of Large Language Models (LLMs).

Is this sufficient support for motivation, thereby justifying the necessity of Speculative Decoding?

That would also be quite a good point for motivation. A proposal with techniques used in real-world scenarios will definitely strengthen the motivation.

Besides, the previous quantified results can be further improved using more real-world examples. Showing high throughput does not necessarily mean productivity boost, unless we prove that the current condition with a low throughput is not acceptable.

To finally form a closed loop for the logical chain, we still need an example. That is, given an LLM-produced article with xxx words, the inference latency can drop from yyy seconds to zzz seconds with the proposed technique, showing a xxxxx% improvement.

Sure, I have refined my proposal according to maintainers' comments on today's regular community meeting.

@FuryMartin FuryMartin requested a review from MooreZheng December 5, 2024 10:01
@MooreZheng
Copy link
Collaborator

/lgtm

@kubeedge-bot kubeedge-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 5, 2024
@hsj576
Copy link
Member

hsj576 commented Dec 6, 2024

/lgtm

Copy link
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All concerns from reviewers are tackled. Well done!

@MooreZheng
Copy link
Collaborator

/approve

@kubeedge-bot kubeedge-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2024
@kubeedge-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MooreZheng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubeedge-bot kubeedge-bot merged commit aefdbeb into kubeedge:main Dec 6, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/design Categorizes issue or PR as related to design. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cloud-edge collaborative speculative decoding for LLM based on KubeEdge-Ianvs
4 participants