Skip to content

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519

@stas00

Description

@stas00

Your current environment

0.6.3.post1

4 🐛generation scenarios

There are at least 4 generation use cases in vLLM:

  1. offline generate
  2. offline chat
  3. online completion (similar to 1 but online and a totally different implementation)
  4. online chat completion (similar to 2 but online and a totally different implementation)

It's up to the user whether they want to handle the chat template themselves and then use (1) or (3) or let vllm do the chat template handling and then it's (2) or (4).

Summary of BOS injection/duplication

I have traced all 4 APIs wrt BOS-injection and here is what I see (0.6.3post1):

  1. offline generate - the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o <|begin_of_text|>
  2. offline chat - BOS is still always forced - so generates 2 BOS tokens if the template already has one - this is a BUG and can't be overcome by a user, other than by passing a custom chat template which has <|begin_of_text|> manually removed.
  3. client.completion - the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o <|begin_of_text|>
  4. client.chat.completions - the chat template is applied on the server side: here the BOS isn't added twice - if the template contains <|begin_of_text|> it encodes it properly - ending up with a single BOS

Expectations and bugs

So for (1) and (3) one could say it's the user's responsibility to strip any BOS tokens in the prompt since a normal prompt is expected here. (normal == pure text w/o any special tokens - as in "Today is")

(2) is clearly a bug and it's inconsistent with (4). With meta-llama/Meta-Llama-3-8B-Instruct you would see this logged with (2): {'prompt_token_ids': [128000, 128000, 128006, 9125, 128007, ... where 128000 is the BOS token.

(4) used to have this problem but has been fixed in #4688

Analysis process

The online API already logs the token ids it's about to feed to the model so that was easy. The offline API doesn't do it - so I had to add:

diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index 61c21887..71b82baf 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -808,6 +808,9 @@ class LLMEngine:
             lora_request=lora_request,
             prompt_adapter_request=prompt_adapter_request,
         )
+
+        print(preprocessed_inputs)
+
         processed_inputs = self.input_processor(preprocessed_inputs)

         # This is a bit of a hack - copy the mm_processor_kwargs that were

Request: is it possible to codify the above diff - so that the user could debug the offline scenario in the same way the online scenario currently logs:

INFO 10-18 17:53:34 logger.py:37] Received request cmpl-da6d014eca6e48acb5abb2d2cae39182-0: 
prompt: '<|begin_of_text|><|start_header_id|>..
prompt_token_ids: [128000, 128000, 128006, 9125, 128007...

Needed documentation

Wrt (1) and (3) I'd imagine vllm should have a clear documentation of when it adds BOS forcefully. That is ideally the prompt doc needs to say that it must not include tokenizer.bos_token (e.g. <|begin_of_text|> in many tokenizers).

Reproduction

To reproduce I was just using your examples:

  1. https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference.html
  2. https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference_chat.html
    etc. but prepended the existing prompt with <|begin_of_text|> in the non-chat examples to test.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions