Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519

Open
stas00 opened this issue Oct 18, 2024 · 6 comments
Open

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519

stas00 opened this issue Oct 18, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@stas00
Copy link
Contributor

stas00 commented Oct 18, 2024

Your current environment

0.6.3.post1

4 🐛generation scenarios

There are at least 4 generation use cases in vLLM:

  1. offline generate
  2. offline chat
  3. online completion (similar to 1 but online and a totally different implementation)
  4. online chat completion (similar to 2 but online and a totally different implementation)

It's up to the user whether they want to handle the chat template themselves and then use (1) or (3) or let vllm do the chat template handling and then it's (2) or (4).

Summary of BOS injection/duplication

I have traced all 4 APIs wrt BOS-injection and here is what I see (0.6.3post1):

  1. offline generate - the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o <|begin_of_text|>
  2. offline chat - BOS is still always forced - so generates 2 BOS tokens if the template already has one - this is a BUG and can't be overcome by a user, other than by passing a custom chat template which has <|begin_of_text|> manually removed.
  3. client.completion - the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o <|begin_of_text|>
  4. client.chat.completions - the chat template is applied on the server side: here the BOS isn't added twice - if the template contains <|begin_of_text|> it encodes it properly - ending up with a single BOS

Expectations and bugs

So for (1) and (3) one could say it's the user's responsibility to strip any BOS tokens in the prompt since a normal prompt is expected here. (normal == pure text w/o any special tokens - as in "Today is")

(2) is clearly a bug and it's inconsistent with (4). With meta-llama/Meta-Llama-3-8B-Instruct you would see this logged with (2): {'prompt_token_ids': [128000, 128000, 128006, 9125, 128007, ... where 128000 is the BOS token.

(4) used to have this problem but has been fixed in #4688

Analysis process

The online API already logs the token ids it's about to feed to the model so that was easy. The offline API doesn't do it - so I had to add:

diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index 61c21887..71b82baf 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -808,6 +808,9 @@ class LLMEngine:
             lora_request=lora_request,
             prompt_adapter_request=prompt_adapter_request,
         )
+
+        print(preprocessed_inputs)
+
         processed_inputs = self.input_processor(preprocessed_inputs)

         # This is a bit of a hack - copy the mm_processor_kwargs that were

Request: is it possible to codify the above diff - so that the user could debug the offline scenario in the same way the online scenario currently logs:

INFO 10-18 17:53:34 logger.py:37] Received request cmpl-da6d014eca6e48acb5abb2d2cae39182-0: 
prompt: '<|begin_of_text|><|start_header_id|>..
prompt_token_ids: [128000, 128000, 128006, 9125, 128007...

Needed documentation

Wrt (1) and (3) I'd imagine vllm should have a clear documentation of when it adds BOS forcefully. That is ideally the prompt doc needs to say that it must not include tokenizer.bos_token (e.g. <|begin_of_text|> in many tokenizers).

Reproduction

To reproduce I was just using your examples:

  1. https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference.html
  2. https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference_chat.html
    etc. but prepended the existing prompt with <|begin_of_text|> in the non-chat examples to test.

Thank you!

@stas00 stas00 added the bug Something isn't working label Oct 18, 2024
@stas00 stas00 changed the title [Bug]: Multiple inconsistencies wrt BOS (duplication) [Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication Oct 18, 2024
@njhill
Copy link
Member

njhill commented Oct 18, 2024

Thanks again @stas00 for this detailed analysis! You are clearly correct re the (2) bug.

I will add that there is a per-request add_special_tokens parameter that can be used with both (3) and (4) which will control whether the BOS token is added. As you found, this defaults to False for the chat case (since it's included in the templates) and True for non-chat (completion) case. But you can override in either case if needed.

Also worth mentioning that whether a BOS token is added automatically depends on the particular model (tokenizer). But it's the case for e.g. llama and mistral tokenizers.

@stas00
Copy link
Contributor Author

stas00 commented Oct 18, 2024

I will add that there is a per-request add_special_tokens parameter that can be used with both (3) and (4) which will control whether the BOS token is added

I actually was trying to test it out already, but the openai's client.chat.completions and client.completions don't have this kwarg - where/how do you add add_special_tokens? I saw that you extended the openAI API, but I didn't find documentation to how to use it with the openAI client.

@njhill
Copy link
Member

njhill commented Oct 18, 2024

@stas00 you can pass any of the "non-standard" API params in a dict via the extra_body kwarg of the openai client sdk. Ref here.

@stas00
Copy link
Contributor Author

stas00 commented Oct 18, 2024

Also worth mentioning that whether a BOS token is added automatically depends on the particular model (tokenizer). But it's the case for e.g. llama and mistral tokenizers.

Once the dust settles on this Issue, let's document the various ways BOS is handled, including your note above?

@stas00
Copy link
Contributor Author

stas00 commented Oct 18, 2024

Ah, thank you so much Nick, for some reason I was drawing blank on finding how to use it.

Now I was able to tell vllm not to add BOS in the online generate case with:

    completion = client.completions.create(
        model=model,
        prompt=prompt,
[...]
        extra_body=dict(add_special_tokens=False),
    )

super!

@stas00
Copy link
Contributor Author

stas00 commented Oct 18, 2024

should the LLM API be consistent with the client API and offer add_special_tokens=False? I guess it'd belong to SamplingParams and not LLM

SamplingParams already has skip_special_tokens - but refers to output, so it'd be confusing to add add_special_tokens since now it'd be unclear whether it's an input or output related.

Possibly it might require disambiguation with:

  • output_skip_special_tokens
  • input_add_special_tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants