-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Multiple fixes to tool streaming with hermes and mistral parsers #10782
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
0b99d2d
to
852017c
Compare
Almost all of the issues identified are intermittent, but frequently occur depending on the rate that tokens are returned. While I believe I've fixed the main edge cases, I remain unhappy with how hacky the overall streaming tool parsing is and the inelegant fixes required to make it work, so I've provided thoughts on refactoring below as well. In addition to items mentioned in my original bug report, there was inconsistent usage of json.dumps()'s ensure_ascii parameter in the various tool parsers.
From my testing, the tokens are generally being returned from Qwen2.5 and Mistral-Large in UTF8 and we should avoid further escaping as part of the tool parsing. Using ensure_ascii=False as the default may require other tool parsers to update to better support streaming non-ASCII parameters, but given that they are already affected by at least the first problem (and likely all 3) of the problems mentioned in my bug report, this PR should already improve the results and give others a pattern to fix the issues I'm seeing. Refactoring thoughtsA refactor of tool streaming would use the arguments as they come with a cursor tracking what has been sent to the client, rather than the constant loads/dumps/diff pattern that is in-place currently. We should also not be updating the arrays of what has been sent to the client before data is actually sent to the client, as that pattern (which I would need to change all tool parsers to fix) is the root cause of the missing final delta. |
cc @K-Mistele |
Did some extensive testing this evening, including smaller models that generate at faster speeds. Fixed a few minor existing bugs, but there are two edge cases that will require more involved fixes:
You will currently get empty arguments, as the current logic in the tool parsers does json.loads() when the first delta has been provided. It is too early then to get a valid JSON object, so arguments remains = {} When the second delta is received, it also includes the tool-end token. The arguments array is still as {} and the "end" sequence currently doesn't handle this case. The simplest fix will be to add more checks for this situation and reload the arguments and concatenate the final delta to it, but this is more complicated that it sounds. Also not a regression and my sense is that this one waits for a refactor that avoids all of the JSON rewriting in the first place. |
Found a fix for tool calls with short arguments when using Mistral and Hermes parsers. All tests are now passing as well. I believe this PR should be ready. We can open a separate issue on truncation with speculative decoding, which is a pre-existing issue that doesn't seem related to the tool parser (debug logs show the client session being finished while deltas are still being processed). I suspect that we will find other streaming bugs in the other (non-hermes/mistral) tool parsers, but we're not introducing regressions with this PR and we're providing fix patterns that others can use to resolve similar issues. I also suspect that we'll still find some edge cases depending on chunking, so a refactor to make the tool parsing code less fragile should still be considered at some point. Looking through some other bug reports, it looks like this PR may also fix #10589 but the issue doesn't have enough reproduction code for me to test. |
During the startup of the api server the setup function is called multiple times (every 5s). So the longer the longer the startup time (generally for larger models) the more consumers are contending for the output. This can then lead to race condition where the order of the answer token is wrong. Introduce here: vllm-project#9973 References: vllm-project#10376 vllm-project#10589 vllm-project#10782 Signed-off-by: Jannis Schönleber <[email protected]>
During the startup of the api server the setup function is called multiple times (every 5s). So the longer the longer the startup time (generally for larger models) the more consumers are contending for the output. This can then lead to race condition where the order of the answer token is wrong. Introduce here: vllm-project#9973 References: vllm-project#10376 vllm-project#10589 vllm-project#10782 Signed-off-by: Jannis Schönleber <[email protected]>
During the startup of the api server the setup function is called multiple times (every 5s). So the longer the longer the startup time (generally for larger models) the more consumers are contending for the output. This can then lead to race condition where the order of the answer token is wrong. Introduce here: vllm-project#9973 References: vllm-project#10376 vllm-project#10589 vllm-project#10782 Signed-off-by: Jannis Schönleber <[email protected]>
During the startup of the api server the setup function is called multiple times (every 5s). So the longer the longer the startup time (generally for larger models) the more consumers are contending for the output. This can then lead to race condition where the order of the answer token is wrong. Introduce here: vllm-project#9973 References: vllm-project#10376 vllm-project#10589 vllm-project#10782 Signed-off-by: Jannis Schönleber <[email protected]>
I reviewed the test failure after commit 562e91b. The errors surface in the pythonic tool parser, which does not interact with the hermes or mistral parsers that I updated in the latest commit. The previous commit passes all tests and my local pytest run also passes with the latest commit. I suspect that the pythonic (and other parsers) may have similar issues to the ones I solved in this PR for mistral/hermes that only surface intermittently, as the logs in this run show that the streaming tool call arguments are truncated. I suggest a separate bug & fix related to intermittent streaming issues with pythonic tokenizer rather than continue to grow this PR. |
4fdc258
to
0f9e8cf
Compare
This pull request has merge conflicts that must be resolved before it can be |
Ugh! Too much copy/paste when fixing my lack of DCO sign-off fix when accepting suggestion. This is going to be fun to undo. |
0f9e8cf
to
1c768fe
Compare
During the startup of the api server the setup function is called multiple times (every 5s). So the longer the longer the startup time (generally for larger models) the more consumers are contending for the output. This can then lead to race condition where the order of the answer token is wrong. Introduce here: vllm-project#9973 References: vllm-project#10376 vllm-project#10589 vllm-project#10782 Signed-off-by: Jannis Schönleber <[email protected]>
Fixes chat completion tool argument streaming when using auto tool choice.
FIX #10781
Adds the unsent delta text when sending the final streaming delta. Tested with Qwen2.5 instruct models for Hermes parsing and Mistral-Large-Instruct-2411 for the Mistral parser. Minimized changes to avoid introduce new issues to other parsers, but they may still have existing issues that have not been reported.
Longer-term, we should refactor the tool streaming code in general, as it creates a lot of temporary strings and has a lot of complicated logic that would be less necessary if we just used a cursor to track what had already been streamed vs. continuously looking for diffs.