[Bugfix] Multiple fixes to tool streaming with hermes and mistral parsers #10782

cedonley · 2024-11-29T18:57:45Z

Fixes chat completion tool argument streaming when using auto tool choice.

Adds the unsent delta text when sending the final streaming delta. Tested with Qwen2.5 instruct models for Hermes parsing and Mistral-Large-Instruct-2411 for the Mistral parser. Minimized changes to avoid introduce new issues to other parsers, but they may still have existing issues that have not been reported.

Longer-term, we should refactor the tool streaming code in general, as it creates a lot of temporary strings and has a lot of complicated logic that would be less necessary if we just used a cursor to track what had already been streamed vs. continuously looking for diffs.

github-actions · 2024-11-29T18:57:58Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

cedonley · 2024-11-30T02:15:22Z

Almost all of the issues identified are intermittent, but frequently occur depending on the rate that tokens are returned. While I believe I've fixed the main edge cases, I remain unhappy with how hacky the overall streaming tool parsing is and the inelegant fixes required to make it work, so I've provided thoughts on refactoring below as well.

In addition to items mentioned in my original bug report, there was inconsistent usage of json.dumps()'s ensure_ascii parameter in the various tool parsers.

Hermes in one location was setting to False, but most usages were leaving at default (True)
Mistral was generally using the default (True)

From my testing, the tokens are generally being returned from Qwen2.5 and Mistral-Large in UTF8 and we should avoid further escaping as part of the tool parsing.

Using ensure_ascii=False as the default may require other tool parsers to update to better support streaming non-ASCII parameters, but given that they are already affected by at least the first problem (and likely all 3) of the problems mentioned in my bug report, this PR should already improve the results and give others a pattern to fix the issues I'm seeing.

Refactoring thoughts

A refactor of tool streaming would use the arguments as they come with a cursor tracking what has been sent to the client, rather than the constant loads/dumps/diff pattern that is in-place currently. We should also not be updating the arrays of what has been sent to the client before data is actually sent to the client, as that pattern (which I would need to change all tool parsers to fix) is the root cause of the missing final delta.

DarkLight1337 · 2024-11-30T04:26:47Z

cc @K-Mistele

cedonley · 2024-11-30T07:26:51Z

Did some extensive testing this evening, including smaller models that generate at faster speeds. Fixed a few minor existing bugs, but there are two edge cases that will require more involved fixes:

When using speculative decoding w/ Qwen2.5, the final delta is still not sent. No idea why and not planning to fix in this PR. It's not a regression and probably not worth fixing prior to refactoring the streaming tool code.
If you have a single argument with a very short value and it comes in a sequence like:

Delta 1: {"pro
Delta 2: mpt": "dog"}

You will currently get empty arguments, as the current logic in the tool parsers does json.loads() when the first delta has been provided. It is too early then to get a valid JSON object, so arguments remains = {}

When the second delta is received, it also includes the tool-end token. The arguments array is still as {} and the "end" sequence currently doesn't handle this case. The simplest fix will be to add more checks for this situation and reload the arguments and concatenate the final delta to it, but this is more complicated that it sounds. Also not a regression and my sense is that this one waits for a refactor that avoids all of the JSON rewriting in the first place.

cedonley · 2024-11-30T21:50:57Z

Found a fix for tool calls with short arguments when using Mistral and Hermes parsers. All tests are now passing as well. I believe this PR should be ready.

We can open a separate issue on truncation with speculative decoding, which is a pre-existing issue that doesn't seem related to the tool parser (debug logs show the client session being finished while deltas are still being processed).

I suspect that we will find other streaming bugs in the other (non-hermes/mistral) tool parsers, but we're not introducing regressions with this PR and we're providing fix patterns that others can use to resolve similar issues. I also suspect that we'll still find some edge cases depending on chunking, so a refactor to make the tool parsing code less fragile should still be considered at some point.

Looking through some other bug reports, it looks like this PR may also fix #10589 but the issue doesn't have enough reproduction code for me to test.

During the startup of the api server the setup function is called multiple times (every 5s). So the longer the longer the startup time (generally for larger models) the more consumers are contending for the output. This can then lead to race condition where the order of the answer token is wrong. Introduce here: vllm-project#9973 References: vllm-project#10376 vllm-project#10589 vllm-project#10782 Signed-off-by: Jannis Schönleber <[email protected]>

cedonley · 2024-12-05T18:31:40Z

This PR should now resolve additional mistral inconsistencies between stream=True and stream=False related to the generation of the tool_call_id and resolves issues #10821 and #10900

cedonley · 2024-12-06T02:45:29Z

I reviewed the test failure after commit 562e91b. The errors surface in the pythonic tool parser, which does not interact with the hermes or mistral parsers that I updated in the latest commit. The previous commit passes all tests and my local pytest run also passes with the latest commit.

I suspect that the pythonic (and other parsers) may have similar issues to the ones I solved in this PR for mistral/hermes that only surface intermittently, as the logs in this run show that the streaming tool call arguments are truncated. I suggest a separate bug & fix related to intermittent streaming issues with pythonic tokenizer rather than continue to grow this PR.

vllm/entrypoints/openai/tool_parsers/mistral_tool_parser.py

mergify · 2024-12-07T16:38:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cedonley.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

cedonley · 2024-12-07T16:52:29Z

Ugh! Too much copy/paste when fixing my lack of DCO sign-off fix when accepting suggestion. This is going to be fun to undo.

During the startup of the api server the setup function is called multiple times (every 5s). So the longer the longer the startup time (generally for larger models) the more consumers are contending for the output. This can then lead to race condition where the order of the answer token is wrong. Introduce here: vllm-project#9973 References: vllm-project#10376 vllm-project#10589 vllm-project#10782 Signed-off-by: Jannis Schönleber <[email protected]>

mergify bot added the frontend label Nov 29, 2024

cedonley force-pushed the fix_toolstream_truncate branch 2 times, most recently from 0b99d2d to 852017c Compare November 30, 2024 01:52

joennlae mentioned this pull request Dec 1, 2024

[Bugfix] fix race condition that leads to wrong order of token returned #10802

Open

Sala8888 mentioned this pull request Dec 2, 2024

[Bug] Streaming output error of tool calling has still not been resolved. #10589

Closed

cedonley mentioned this pull request Dec 5, 2024

[Bug]: mistral tool choice error #10821

Closed

1 task

cedonley changed the title ~~[Bugfix] Multiple fixes to tool streaming when using auto tool choice.~~ [Bugfix] Multiple fixes to tool streaming with hermes and mistral parsers Dec 5, 2024

gcalmettes reviewed Dec 7, 2024

View reviewed changes

vllm/entrypoints/openai/tool_parsers/mistral_tool_parser.py Outdated Show resolved Hide resolved

cedonley force-pushed the fix_toolstream_truncate branch from 4fdc258 to 0f9e8cf Compare December 7, 2024 16:37

cedonley requested review from DarkLight1337, ywang96, robertgshaw2-neuralmagic, simon-mo, WoosukKwon, njhill, comaniac, alexm-neuralmagic, zhuohan123 and youkaichao as code owners December 7, 2024 16:37

mergify bot added the documentation Improvements or additions to documentation label Dec 7, 2024

mergify bot added ci/build needs-rebase labels Dec 7, 2024

cedonley closed this Dec 7, 2024

cedonley force-pushed the fix_toolstream_truncate branch from 0f9e8cf to 1c768fe Compare December 7, 2024 17:01

cedonley deleted the fix_toolstream_truncate branch December 7, 2024 17:15

cedonley mentioned this pull request Dec 7, 2024

[Bugfix] Multiple fixes to tool streaming with hermes and mistral #10979

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Multiple fixes to tool streaming with hermes and mistral parsers #10782

[Bugfix] Multiple fixes to tool streaming with hermes and mistral parsers #10782

cedonley commented Nov 29, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 29, 2024

cedonley commented Nov 30, 2024

DarkLight1337 commented Nov 30, 2024

cedonley commented Nov 30, 2024

cedonley commented Nov 30, 2024

cedonley commented Dec 5, 2024

cedonley commented Dec 6, 2024

mergify bot commented Dec 7, 2024

cedonley commented Dec 7, 2024

[Bugfix] Multiple fixes to tool streaming with hermes and mistral parsers #10782

[Bugfix] Multiple fixes to tool streaming with hermes and mistral parsers #10782

Conversation

cedonley commented Nov 29, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 29, 2024

cedonley commented Nov 30, 2024

Refactoring thoughts

DarkLight1337 commented Nov 30, 2024

cedonley commented Nov 30, 2024

cedonley commented Nov 30, 2024

cedonley commented Dec 5, 2024

cedonley commented Dec 6, 2024

mergify bot commented Dec 7, 2024

cedonley commented Dec 7, 2024

cedonley commented Nov 29, 2024 •

edited by github-actions bot

Loading