Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Multiple fixes to tool streaming with hermes and mistral parsers #10782

Closed
wants to merge 0 commits into from

Conversation

cedonley
Copy link
Contributor

@cedonley cedonley commented Nov 29, 2024

Fixes chat completion tool argument streaming when using auto tool choice.

FIX #10781

Adds the unsent delta text when sending the final streaming delta. Tested with Qwen2.5 instruct models for Hermes parsing and Mistral-Large-Instruct-2411 for the Mistral parser. Minimized changes to avoid introduce new issues to other parsers, but they may still have existing issues that have not been reported.

Longer-term, we should refactor the tool streaming code in general, as it creates a lot of temporary strings and has a lot of complicated logic that would be less necessary if we just used a cursor to track what had already been streamed vs. continuously looking for diffs.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the frontend label Nov 29, 2024
@cedonley cedonley force-pushed the fix_toolstream_truncate branch 2 times, most recently from 0b99d2d to 852017c Compare November 30, 2024 01:52
@cedonley
Copy link
Contributor Author

Almost all of the issues identified are intermittent, but frequently occur depending on the rate that tokens are returned. While I believe I've fixed the main edge cases, I remain unhappy with how hacky the overall streaming tool parsing is and the inelegant fixes required to make it work, so I've provided thoughts on refactoring below as well.

In addition to items mentioned in my original bug report, there was inconsistent usage of json.dumps()'s ensure_ascii parameter in the various tool parsers.

  • Hermes in one location was setting to False, but most usages were leaving at default (True)
  • Mistral was generally using the default (True)

From my testing, the tokens are generally being returned from Qwen2.5 and Mistral-Large in UTF8 and we should avoid further escaping as part of the tool parsing.

Using ensure_ascii=False as the default may require other tool parsers to update to better support streaming non-ASCII parameters, but given that they are already affected by at least the first problem (and likely all 3) of the problems mentioned in my bug report, this PR should already improve the results and give others a pattern to fix the issues I'm seeing.

Refactoring thoughts

A refactor of tool streaming would use the arguments as they come with a cursor tracking what has been sent to the client, rather than the constant loads/dumps/diff pattern that is in-place currently. We should also not be updating the arrays of what has been sent to the client before data is actually sent to the client, as that pattern (which I would need to change all tool parsers to fix) is the root cause of the missing final delta.

@DarkLight1337
Copy link
Member

cc @K-Mistele

@cedonley
Copy link
Contributor Author

Did some extensive testing this evening, including smaller models that generate at faster speeds. Fixed a few minor existing bugs, but there are two edge cases that will require more involved fixes:

  1. When using speculative decoding w/ Qwen2.5, the final delta is still not sent. No idea why and not planning to fix in this PR. It's not a regression and probably not worth fixing prior to refactoring the streaming tool code.

  2. If you have a single argument with a very short value and it comes in a sequence like:

Delta 1: {"pro
Delta 2: mpt": "dog"}

You will currently get empty arguments, as the current logic in the tool parsers does json.loads() when the first delta has been provided. It is too early then to get a valid JSON object, so arguments remains = {}

When the second delta is received, it also includes the tool-end token. The arguments array is still as {} and the "end" sequence currently doesn't handle this case. The simplest fix will be to add more checks for this situation and reload the arguments and concatenate the final delta to it, but this is more complicated that it sounds. Also not a regression and my sense is that this one waits for a refactor that avoids all of the JSON rewriting in the first place.

@cedonley
Copy link
Contributor Author

Found a fix for tool calls with short arguments when using Mistral and Hermes parsers. All tests are now passing as well. I believe this PR should be ready.

We can open a separate issue on truncation with speculative decoding, which is a pre-existing issue that doesn't seem related to the tool parser (debug logs show the client session being finished while deltas are still being processed).

I suspect that we will find other streaming bugs in the other (non-hermes/mistral) tool parsers, but we're not introducing regressions with this PR and we're providing fix patterns that others can use to resolve similar issues. I also suspect that we'll still find some edge cases depending on chunking, so a refactor to make the tool parsing code less fragile should still be considered at some point.

Looking through some other bug reports, it looks like this PR may also fix #10589 but the issue doesn't have enough reproduction code for me to test.

joennlae added a commit to 44ai-labs/vllm that referenced this pull request Dec 1, 2024
During the startup of the api server the setup function is called
multiple times (every 5s). So the longer the longer the startup time
(generally for larger models) the more consumers are contending for the
output. This can then lead to race condition where the order of the
answer token is wrong.

Introduce here: vllm-project#9973

References:
vllm-project#10376
vllm-project#10589
vllm-project#10782

Signed-off-by: Jannis Schönleber <[email protected]>
joennlae added a commit to 44ai-labs/vllm that referenced this pull request Dec 1, 2024
During the startup of the api server the setup function is called
multiple times (every 5s). So the longer the longer the startup time
(generally for larger models) the more consumers are contending for the
output. This can then lead to race condition where the order of the
answer token is wrong.

Introduce here: vllm-project#9973

References:
vllm-project#10376
vllm-project#10589
vllm-project#10782

Signed-off-by: Jannis Schönleber <[email protected]>
joennlae added a commit to 44ai-labs/vllm that referenced this pull request Dec 1, 2024
During the startup of the api server the setup function is called
multiple times (every 5s). So the longer the longer the startup time
(generally for larger models) the more consumers are contending for the
output. This can then lead to race condition where the order of the
answer token is wrong.

Introduce here: vllm-project#9973

References:
vllm-project#10376
vllm-project#10589
vllm-project#10782

Signed-off-by: Jannis Schönleber <[email protected]>
joennlae added a commit to 44ai-labs/vllm that referenced this pull request Dec 1, 2024
During the startup of the api server the setup function is called
multiple times (every 5s). So the longer the longer the startup time
(generally for larger models) the more consumers are contending for the
output. This can then lead to race condition where the order of the
answer token is wrong.

Introduce here: vllm-project#9973

References:
vllm-project#10376
vllm-project#10589
vllm-project#10782

Signed-off-by: Jannis Schönleber <[email protected]>
@cedonley cedonley mentioned this pull request Dec 5, 2024
1 task
@cedonley
Copy link
Contributor Author

cedonley commented Dec 5, 2024

This PR should now resolve additional mistral inconsistencies between stream=True and stream=False related to the generation of the tool_call_id and resolves issues #10821 and #10900

@cedonley cedonley changed the title [Bugfix] Multiple fixes to tool streaming when using auto tool choice. [Bugfix] Multiple fixes to tool streaming with hermes and mistral parsers Dec 5, 2024
@cedonley
Copy link
Contributor Author

cedonley commented Dec 6, 2024

I reviewed the test failure after commit 562e91b. The errors surface in the pythonic tool parser, which does not interact with the hermes or mistral parsers that I updated in the latest commit. The previous commit passes all tests and my local pytest run also passes with the latest commit.

I suspect that the pythonic (and other parsers) may have similar issues to the ones I solved in this PR for mistral/hermes that only surface intermittently, as the logs in this run show that the streaming tool call arguments are truncated. I suggest a separate bug & fix related to intermittent streaming issues with pythonic tokenizer rather than continue to grow this PR.

@mergify mergify bot added the documentation Improvements or additions to documentation label Dec 7, 2024
Copy link

mergify bot commented Dec 7, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cedonley.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@cedonley
Copy link
Contributor Author

cedonley commented Dec 7, 2024

Ugh! Too much copy/paste when fixing my lack of DCO sign-off fix when accepting suggestion. This is going to be fun to undo.

@cedonley cedonley closed this Dec 7, 2024
@cedonley cedonley force-pushed the fix_toolstream_truncate branch from 0f9e8cf to 1c768fe Compare December 7, 2024 17:01
@cedonley cedonley deleted the fix_toolstream_truncate branch December 7, 2024 17:15
joennlae added a commit to 44ai-labs/vllm that referenced this pull request Dec 15, 2024
During the startup of the api server the setup function is called
multiple times (every 5s). So the longer the longer the startup time
(generally for larger models) the more consumers are contending for the
output. This can then lead to race condition where the order of the
answer token is wrong.

Introduce here: vllm-project#9973

References:
vllm-project#10376
vllm-project#10589
vllm-project#10782

Signed-off-by: Jannis Schönleber <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend needs-rebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Streaming w/ tool choice auto often truncates the final delta in the streamed arguments
3 participants