[Roadmap] vLLM Roadmap Q4 2024 #9006

simon-mo · 2024-10-01T17:39:50Z

IsaacRe · 2024-10-02T19:33:07Z

Support for KV cache compression

upstream https://github.com/IsaacRe/vllm-kvcompress/tree/main - related issues (3532, 5751)

ksjadeja · 2024-10-04T17:01:03Z

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

sylviayangyy · 2024-10-12T06:39:51Z

Hi, do we have any follow-up issue or Slack channel for the "KV cache offload to CPU and disk" task? Our team has previously explored some "KV cache offload" work based on vLLM, and we’d be happy to join any relevant discussion or contribute to the development if there's such chance~

Personally, also looking forward to know more about "More control in prefix caching, and scheduler policies" part😊.

zeroorhero · 2024-10-12T06:41:34Z

@simon-mo hi，regarding the topic “KV cache offload to CPU and disk”, I previously implemented a version that stores kv cache in a local file(#8018). Of course, I also did relevant abstractions and can add other media. Is there a slack channel for this? We can discuss the specific scheme. I am also quite interested in this function.

simon-mo · 2024-10-14T18:16:09Z

@sylviayangyy @zeroorhero thank you for your interests! Yes. @KuntaiDu has created a #feat-kvcache-offloading to discuss that.

jeejeelee · 2024-10-16T14:40:16Z

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

It looks like LoRA is now supported. Are you encountering any issues?

iiLaurens · 2024-10-19T21:27:53Z

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since.

In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

HuYunhai-Alex · 2024-10-19T21:29:38Z

Whether there is an opportunity to participate in changes related to speculative decoding? I'm working on some of the practices that are going to help you

devdev999 · 2024-10-22T06:58:11Z

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since.

In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput.

Harsha-Nori · 2024-10-22T23:33:30Z

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since.
In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput.

Hey, I maintain the guidance project and we worked on the first proposal in #6273 . Looks like vLLM has changed significantly since then, but if there is appetite for upgraded/more performant guided decoding work from the maintainers, we're happy to take another look and investigate a new PR. In particular, guidance (and our high performance rust implementation in llguidance already does async computations on CPU, calculates fast forward tokens, etc. and is typically accelerative for JSON schema.

@JC1DA @mmoskal

ksjadeja · 2024-10-29T05:31:42Z

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

It looks like LoRA is now supported. Are you encountering any issues?

Yes, if we look at the class in mixtral_quant.py, it does not have SupportsLora which means lora is not supported for quantized Mixtral. but for mixtral.py, we have SupportsLora included in MixtralForCausalLM. I have a LORA adapter trained which I want to use on top of mixtral-awq model without merging, directly as a hot swap. Let me know if you know a better way to tackle this situation

jeejeelee · 2024-10-29T07:17:54Z

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

It looks like LoRA is now supported. Are you encountering any issues?

Yes, if we look at the class in mixtral_quant.py, it does not have SupportsLora which means lora is not supported for quantized Mixtral. but for mixtral.py, we have SupportsLora included in MixtralForCausalLM. I have a LORA adapter trained which I want to use on top of mixtral-awq model without merging, directly as a hot swap. Let me know if you know a better way to tackle this situation

I'm guessing you explicitly set the quantization, right? If so, you can try removing that argument and test it out, like the following script:

llm = LLM(
    model="Mixtral-8x7B-Instruct-v0.1-GPTQ",
    trust_remote_code=True,
    gpu_memory_utilization=0.6,
    enable_lora=True,
)

dbuades · 2024-10-29T21:20:20Z

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since.
In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput.

Hey, I maintain the guidance project and we worked on the first proposal in #6273 . Looks like vLLM has changed significantly since then, but if there is appetite for upgraded/more performant guided decoding work from the maintainers, we're happy to take another look and investigate a new PR. In particular, guidance (and our high performance rust implementation in llguidance already does async computations on CPU, calculates fast forward tokens, etc. and is typically accelerative for JSON schema.

@JC1DA @mmoskal

Improvements in guided generation performance would be very welcome. There is a helpful comment by @stas00 from last month with a nice summary of where things currently stand.

ksjadeja · 2024-10-30T02:14:01Z

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

It looks like LoRA is now supported. Are you encountering any issues?

Yes, if we look at the class in mixtral_quant.py, it does not have SupportsLora which means lora is not supported for quantized Mixtral. but for mixtral.py, we have SupportsLora included in MixtralForCausalLM. I have a LORA adapter trained which I want to use on top of mixtral-awq model without merging, directly as a hot swap. Let me know if you know a better way to tackle this situation

I'm guessing you explicitly set the quantization, right? If so, you can try removing that argument and test it out, like the following script:
llm = LLM(
    model="Mixtral-8x7B-Instruct-v0.1-GPTQ",
    trust_remote_code=True,
    gpu_memory_utilization=0.6,
    enable_lora=True,
)

Tried this, but does not work. I get the same error. Just mentioning that I use awq quantized model
[rank0]: ValueError: Model MixtralForCausalLM does not support LoRA, but LoRA is enabled. Support for this model may be added in the future. If this is important to you, please open an issue on github.

jeejeelee · 2024-10-30T02:41:57Z

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

It looks like LoRA is now supported. Are you encountering any issues?

Yes, if we look at the class in mixtral_quant.py, it does not have SupportsLora which means lora is not supported for quantized Mixtral. but for mixtral.py, we have SupportsLora included in MixtralForCausalLM. I have a LORA adapter trained which I want to use on top of mixtral-awq model without merging, directly as a hot swap. Let me know if you know a better way to tackle this situation

I'm guessing you explicitly set the quantization, right? If so, you can try removing that argument and test it out, like the following script:
llm = LLM(
    model="Mixtral-8x7B-Instruct-v0.1-GPTQ",
    trust_remote_code=True,
    gpu_memory_utilization=0.6,
    enable_lora=True,
)
Tried this, but does not work. I get the same error. Just mentioning that I use awq quantized model [rank0]: ValueError: Model MixtralForCausalLM does not support LoRA, but LoRA is enabled. Support for this model may be added in the future. If this is important to you, please open an issue on github.

Which vllm version are you using?

According to the code in https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/model_executor/model_loader/utils.py#L30, both GPTQ and AWQ quantization methods should be compatible when using version 0.6.3post1

Edenzzzz · 2024-11-11T03:13:50Z

Any interest in vAttention?
#4675

niuzheng168 · 2024-11-14T03:21:08Z

More and more speech model is using a LLM to predict non-text tokens. Like ChatTTS or FishTTS, they are all using a llama to predict speech tokens.
But unlike llama for text, the speech-llama will use a multiple lm_head to predict more than 1 tokens in parallel, and therefor sum the n-tokens embedding when processing the llm input embedding .
I am currently trying to make chattts running with vllm, see here, but lots code need to update and seems break some fundamental design. So maybe you can consider support it officially. It will definitely make more impact to the speech solutions.

kentoym · 2024-11-14T19:49:53Z

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since.
In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput.

Hey, I maintain the guidance project and we worked on the first proposal in #6273 . Looks like vLLM has changed significantly since then, but if there is appetite for upgraded/more performant guided decoding work from the maintainers, we're happy to take another look and investigate a new PR. In particular, guidance (and our high performance rust implementation in llguidance already does async computations on CPU, calculates fast forward tokens, etc. and is typically accelerative for JSON schema.
@JC1DA @mmoskal

Improvements in guided generation performance would be very welcome. There is a helpful comment by @stas00 from last month with a nice summary of where things currently stand.

Do we have plans to improve concurrency performance for guided decoding?
Enabling guided_json for concurrent requests results in significant throughput and latency degradation. (#3567)

Enhancements in concurrency performance for guided decoding would greatly benefit high-volume, real-time applications.

Harsha-Nori · 2024-11-15T02:08:18Z

Quick update -- we've made an initial PR to support guidance as a backend, which does improve performance over current implementations (#10217). Of course, better support for concurrency in general would also help guidance get significantly faster. Happy to support there and help if we can too!

@JC1DA

wanghongyu2001 · 2024-11-24T12:58:17Z

I am interested in optimizations related to speculative decoding. Is there an opportunity to get involved?

Toubat · 2024-11-24T23:13:52Z

I have a somewhat similar question to @wanghongyu2001: if someone is interested in contributing to a specific aspect of vLLM, what’s the recommended path to get involved? Specifically, are there any suggested learning resources to systematically understand the vLLM codebase and, in particular, the v1 architecture?

In addition to navigating through the codebase, are there other structured ways to ramp up, such as design docs, or suggested youtube videos (in case I miss anything), any important PRs/files worth reading through? Would be thrilled to dive in and contribute to the project. Any guidance would be much appreciated!

JC1DA · 2024-11-26T01:22:24Z

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since.
In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput.

Hey, I maintain the guidance project and we worked on the first proposal in #6273 . Looks like vLLM has changed significantly since then, but if there is appetite for upgraded/more performant guided decoding work from the maintainers, we're happy to take another look and investigate a new PR. In particular, guidance (and our high performance rust implementation in llguidance already does async computations on CPU, calculates fast forward tokens, etc. and is typically accelerative for JSON schema.
@JC1DA @mmoskal

Improvements in guided generation performance would be very welcome. There is a helpful comment by @stas00 from last month with a nice summary of where things currently stand.

Do we have plans to improve concurrency performance for guided decoding? Enabling guided_json for concurrent requests results in significant throughput and latency degradation. (#3567)

Enhancements in concurrency performance for guided decoding would greatly benefit high-volume, real-time applications.

We could definitely use a thread pool to process logits list in parallel. As VLLM can run different number of logits processors for each logits in a batch, batched logits processor seems complex to implement.
However, using thread-pool also requires some mandatory changes from the guided decoding libraries:

it must be thread-safe. From what I experimented so far, lm-format-enforcer seems to be not thread-safe and failed in some tests if running with a thread pool
Pytorch in-place operations removal, again these ops failed if using in thread pool
efficient implementation to release GIL immediately after called

Also, I think VLLM is capable of providing multiple output tokens per sequence per step, we can leverage it for fast-forwarded tokens in JSON guided generation (super beneficial to improve performance)

gpgn · 2024-11-27T09:37:44Z

Interested in thoughts/plan on EXL2 support: #3203

dongxiaolong · 2024-11-29T02:20:16Z

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since.
In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput.

Hey, I maintain the guidance project and we worked on the first proposal in #6273 . Looks like vLLM has changed significantly since then, but if there is appetite for upgraded/more performant guided decoding work from the maintainers, we're happy to take another look and investigate a new PR. In particular, guidance (and our high performance rust implementation in llguidance already does async computations on CPU, calculates fast forward tokens, etc. and is typically accelerative for JSON schema.
@JC1DA @mmoskal

Improvements in guided generation performance would be very welcome. There is a helpful comment by @stas00 from last month with a nice summary of where things currently stand.

Do we have plans to improve concurrency performance for guided decoding? Enabling guided_json for concurrent requests results in significant throughput and latency degradation. (#3567)

Enhancements in concurrency performance for guided decoding would greatly benefit high-volume, real-time applications.

Integrating xgrammar could be a good choice: https://github.com/mlc-ai/xgrammar .

jannikstdl · 2024-11-29T15:33:03Z

[ ] Better kernels (FA3, FlashInfer, FlexAttention, Triton)

What kernel is VLLM using as of right now? Asking in consideration of #10780

yumc2573 · 2024-12-18T02:01:16Z

Hello,

I noticed that you have already merged the PR regarding this bug【function_name: Union[str, None] = current_tool_call.get("name")】. Could you please inform me which version of the latest supported image resolves this issue? Additionally, could you share the timeline for releasing the new version image? Currently, I am using version vllm/vllm-openai:v0.6.3.post1.

Thank you.

simon-mo changed the title ~~[Roadmap]: vLLM Roadmap Q4 2024~~ [Roadmap] vLLM Roadmap Q4 2024 Oct 1, 2024

simon-mo mentioned this issue Oct 1, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

Closed

46 tasks

simon-mo pinned this issue Oct 1, 2024

amd-abhikulk mentioned this issue Oct 4, 2024

[Misc]: Need to understand support for torch.compile in Q4 roadmap #9072

Closed

1 task

russellb mentioned this issue Oct 16, 2024

[Misc]: [Question] vLLM's model loading & instance contract, 1 model per vLLM instance, or multiple models per vLLM instance #9429

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] vLLM Roadmap Q4 2024 #9006

[Roadmap] vLLM Roadmap Q4 2024 #9006

simon-mo commented Oct 1, 2024 •

edited by DarkLight1337

Loading

IsaacRe commented Oct 2, 2024

ksjadeja commented Oct 4, 2024

sylviayangyy commented Oct 12, 2024 •

edited

Loading

zeroorhero commented Oct 12, 2024

simon-mo commented Oct 14, 2024

jeejeelee commented Oct 16, 2024

iiLaurens commented Oct 19, 2024 •

edited

Loading

HuYunhai-Alex commented Oct 19, 2024

devdev999 commented Oct 22, 2024

Harsha-Nori commented Oct 22, 2024 •

edited

Loading

ksjadeja commented Oct 29, 2024

jeejeelee commented Oct 29, 2024

dbuades commented Oct 29, 2024

ksjadeja commented Oct 30, 2024 •

edited

Loading

jeejeelee commented Oct 30, 2024

Edenzzzz commented Nov 11, 2024

niuzheng168 commented Nov 14, 2024

kentoym commented Nov 14, 2024

Harsha-Nori commented Nov 15, 2024

wanghongyu2001 commented Nov 24, 2024

Toubat commented Nov 24, 2024

JC1DA commented Nov 26, 2024 •

edited

Loading

gpgn commented Nov 27, 2024

dongxiaolong commented Nov 29, 2024

jannikstdl commented Nov 29, 2024

yumc2573 commented Dec 18, 2024

[Roadmap] vLLM Roadmap Q4 2024 #9006

[Roadmap] vLLM Roadmap Q4 2024 #9006

Comments

simon-mo commented Oct 1, 2024 • edited by DarkLight1337 Loading

Themes.

Broad Model Support

Hardware Support

Performance Optimizations

Production Features

OSS Community

Extensible Architecture

IsaacRe commented Oct 2, 2024

ksjadeja commented Oct 4, 2024

sylviayangyy commented Oct 12, 2024 • edited Loading

zeroorhero commented Oct 12, 2024

simon-mo commented Oct 14, 2024

jeejeelee commented Oct 16, 2024

iiLaurens commented Oct 19, 2024 • edited Loading

HuYunhai-Alex commented Oct 19, 2024

devdev999 commented Oct 22, 2024

Harsha-Nori commented Oct 22, 2024 • edited Loading

ksjadeja commented Oct 29, 2024

jeejeelee commented Oct 29, 2024

dbuades commented Oct 29, 2024

ksjadeja commented Oct 30, 2024 • edited Loading

jeejeelee commented Oct 30, 2024

Edenzzzz commented Nov 11, 2024

niuzheng168 commented Nov 14, 2024

kentoym commented Nov 14, 2024

Harsha-Nori commented Nov 15, 2024

wanghongyu2001 commented Nov 24, 2024

Toubat commented Nov 24, 2024

JC1DA commented Nov 26, 2024 • edited Loading

gpgn commented Nov 27, 2024

dongxiaolong commented Nov 29, 2024

jannikstdl commented Nov 29, 2024

yumc2573 commented Dec 18, 2024

simon-mo commented Oct 1, 2024 •

edited by DarkLight1337

Loading

sylviayangyy commented Oct 12, 2024 •

edited

Loading

iiLaurens commented Oct 19, 2024 •

edited

Loading

Harsha-Nori commented Oct 22, 2024 •

edited

Loading

ksjadeja commented Oct 30, 2024 •

edited

Loading

JC1DA commented Nov 26, 2024 •

edited

Loading