[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519

stas00 · 2024-10-18T21:56:45Z

Your current environment

0.6.3.post1

4 🐛generation scenarios

There are at least 4 generation use cases in vLLM:

offline generate
offline chat
online completion (similar to 1 but online and a totally different implementation)
online chat completion (similar to 2 but online and a totally different implementation)

It's up to the user whether they want to handle the chat template themselves and then use (1) or (3) or let vllm do the chat template handling and then it's (2) or (4).

Summary of BOS injection/duplication

I have traced all 4 APIs wrt BOS-injection and here is what I see (0.6.3post1):

offline generate - the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o <|begin_of_text|>
offline chat - BOS is still always forced - so generates 2 BOS tokens if the template already has one - this is a BUG and can't be overcome by a user, other than by passing a custom chat template which has <|begin_of_text|> manually removed.
client.completion - the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o <|begin_of_text|>
client.chat.completions - the chat template is applied on the server side: here the BOS isn't added twice - if the template contains <|begin_of_text|> it encodes it properly - ending up with a single BOS

Expectations and bugs

So for (1) and (3) one could say it's the user's responsibility to strip any BOS tokens in the prompt since a normal prompt is expected here. (normal == pure text w/o any special tokens - as in "Today is")

(2) is clearly a bug and it's inconsistent with (4). With meta-llama/Meta-Llama-3-8B-Instruct you would see this logged with (2): {'prompt_token_ids': [128000, 128000, 128006, 9125, 128007, ... where 128000 is the BOS token.

(4) used to have this problem but has been fixed in #4688

Analysis process

The online API already logs the token ids it's about to feed to the model so that was easy. The offline API doesn't do it - so I had to add:

diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index 61c21887..71b82baf 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -808,6 +808,9 @@ class LLMEngine:
             lora_request=lora_request,
             prompt_adapter_request=prompt_adapter_request,
         )
+
+        print(preprocessed_inputs)
+
         processed_inputs = self.input_processor(preprocessed_inputs)

         # This is a bit of a hack - copy the mm_processor_kwargs that were

Request: is it possible to codify the above diff - so that the user could debug the offline scenario in the same way the online scenario currently logs:

INFO 10-18 17:53:34 logger.py:37] Received request cmpl-da6d014eca6e48acb5abb2d2cae39182-0: 
prompt: '<|begin_of_text|><|start_header_id|>..
prompt_token_ids: [128000, 128000, 128006, 9125, 128007...

Needed documentation

Wrt (1) and (3) I'd imagine vllm should have a clear documentation of when it adds BOS forcefully. That is ideally the prompt doc needs to say that it must not include tokenizer.bos_token (e.g. <|begin_of_text|> in many tokenizers).

Reproduction

To reproduce I was just using your examples:

https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference.html
https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference_chat.html
etc. but prepended the existing prompt with <|begin_of_text|> in the non-chat examples to test.

Thank you!

The text was updated successfully, but these errors were encountered:

njhill · 2024-10-18T22:34:02Z

Thanks again @stas00 for this detailed analysis! You are clearly correct re the (2) bug.

I will add that there is a per-request add_special_tokens parameter that can be used with both (3) and (4) which will control whether the BOS token is added. As you found, this defaults to False for the chat case (since it's included in the templates) and True for non-chat (completion) case. But you can override in either case if needed.

Also worth mentioning that whether a BOS token is added automatically depends on the particular model (tokenizer). But it's the case for e.g. llama and mistral tokenizers.

stas00 · 2024-10-18T22:38:34Z

I will add that there is a per-request add_special_tokens parameter that can be used with both (3) and (4) which will control whether the BOS token is added

I actually was trying to test it out already, but the openai's client.chat.completions and client.completions don't have this kwarg - where/how do you add add_special_tokens? I saw that you extended the openAI API, but I didn't find documentation to how to use it with the openAI client.

njhill · 2024-10-18T22:40:42Z

@stas00 you can pass any of the "non-standard" API params in a dict via the extra_body kwarg of the openai client sdk. Ref here.

stas00 · 2024-10-18T22:40:43Z

Also worth mentioning that whether a BOS token is added automatically depends on the particular model (tokenizer). But it's the case for e.g. llama and mistral tokenizers.

Once the dust settles on this Issue, let's document the various ways BOS is handled, including your note above?

stas00 · 2024-10-18T22:48:55Z

Ah, thank you so much Nick, for some reason I was drawing blank on finding how to use it.

Now I was able to tell vllm not to add BOS in the online generate case with:

    completion = client.completions.create(
        model=model,
        prompt=prompt,
[...]
        extra_body=dict(add_special_tokens=False),
    )

super!

stas00 · 2024-10-18T22:53:02Z

should the LLM API be consistent with the client API and offer add_special_tokens=False? I guess it'd belong to SamplingParams and not LLM

SamplingParams already has skip_special_tokens - but refers to output, so it'd be confusing to add add_special_tokens since now it'd be unclear whether it's an input or output related.

Possibly it might require disambiguation with:

output_skip_special_tokens
input_add_special_tokens

stas00 added the bug Something isn't working label Oct 18, 2024

stas00 changed the title ~~[Bug]: Multiple inconsistencies wrt BOS (duplication)~~ [Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication Oct 18, 2024

toslunar mentioned this issue Dec 23, 2024

fix: add missing bos_token to example templates #11432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519

stas00 commented Oct 18, 2024 •

edited

Loading

njhill commented Oct 18, 2024 •

edited

Loading

stas00 commented Oct 18, 2024 •

edited

Loading

njhill commented Oct 18, 2024 •

edited

Loading

stas00 commented Oct 18, 2024

stas00 commented Oct 18, 2024 •

edited

Loading

stas00 commented Oct 18, 2024 •

edited

Loading

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519

Comments

stas00 commented Oct 18, 2024 • edited Loading

Your current environment

4 🐛generation scenarios

Summary of BOS injection/duplication

Expectations and bugs

Analysis process

Needed documentation

Reproduction

njhill commented Oct 18, 2024 • edited Loading

stas00 commented Oct 18, 2024 • edited Loading

njhill commented Oct 18, 2024 • edited Loading

stas00 commented Oct 18, 2024

stas00 commented Oct 18, 2024 • edited Loading

stas00 commented Oct 18, 2024 • edited Loading

stas00 commented Oct 18, 2024 •

edited

Loading

njhill commented Oct 18, 2024 •

edited

Loading

stas00 commented Oct 18, 2024 •

edited

Loading

njhill commented Oct 18, 2024 •

edited

Loading

stas00 commented Oct 18, 2024 •

edited

Loading

stas00 commented Oct 18, 2024 •

edited

Loading