-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519
Comments
Thanks again @stas00 for this detailed analysis! You are clearly correct re the (2) bug. I will add that there is a per-request Also worth mentioning that whether a BOS token is added automatically depends on the particular model (tokenizer). But it's the case for e.g. llama and mistral tokenizers. |
I actually was trying to test it out already, but the openai's |
Once the dust settles on this Issue, let's document the various ways BOS is handled, including your note above? |
Ah, thank you so much Nick, for some reason I was drawing blank on finding how to use it. Now I was able to tell vllm not to add BOS in the online generate case with:
super! |
should the
Possibly it might require disambiguation with:
|
Your current environment
0.6.3.post1
4 🐛generation scenarios
There are at least 4 generation use cases in vLLM:
It's up to the user whether they want to handle the chat template themselves and then use (1) or (3) or let vllm do the chat template handling and then it's (2) or (4).
Summary of BOS injection/duplication
I have traced all 4 APIs wrt BOS-injection and here is what I see (0.6.3post1):
generate
- the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o<|begin_of_text|>
chat
- BOS is still always forced - so generates 2 BOS tokens if the template already has one - this is a BUG and can't be overcome by a user, other than by passing a custom chat template which has<|begin_of_text|>
manually removed.client.completion
- the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o<|begin_of_text|>
client.chat.completions
- the chat template is applied on the server side: here the BOS isn't added twice - if the template contains<|begin_of_text|>
it encodes it properly - ending up with a single BOSExpectations and bugs
So for (1) and (3) one could say it's the user's responsibility to strip any BOS tokens in the prompt since a normal prompt is expected here. (normal == pure text w/o any special tokens - as in "Today is")
(2) is clearly a bug and it's inconsistent with (4). With
meta-llama/Meta-Llama-3-8B-Instruct
you would see this logged with (2):{'prompt_token_ids': [128000, 128000, 128006, 9125, 128007, ...
where128000
is the BOS token.(4) used to have this problem but has been fixed in #4688
Analysis process
The online API already logs the token ids it's about to feed to the model so that was easy. The offline API doesn't do it - so I had to add:
Request: is it possible to codify the above diff - so that the user could debug the offline scenario in the same way the online scenario currently logs:
Needed documentation
Wrt (1) and (3) I'd imagine vllm should have a clear documentation of when it adds BOS forcefully. That is ideally the
prompt
doc needs to say that it must not includetokenizer.bos_token
(e.g.<|begin_of_text|>
in many tokenizers).Reproduction
To reproduce I was just using your examples:
etc. but prepended the existing prompt with
<|begin_of_text|>
in the non-chat examples to test.Thank you!
The text was updated successfully, but these errors were encountered: