[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication

### Your current environment

0.6.3.post1

### 4 🐛generation scenarios

There are at least 4 generation use cases in vLLM:

1. offline generate
2. offline chat
3. online completion (similar to 1 but online and a totally different implementation)
4. online chat completion (similar to 2 but online and a totally different implementation)

It's up to the user whether they want to handle the chat template themselves and then use (1) or (3) or let vllm do the chat template handling and then it's (2) or (4).

### Summary of BOS injection/duplication

I have traced all 4 APIs wrt BOS-injection and here is what I see (0.6.3post1):

1. offline `generate` - the client sorts out the chat template - **BOS is forced always** - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o `<|begin_of_text|>`
2. offline `chat` - **BOS is still always forced** - so generates 2 BOS tokens if the template already has one - this is a BUG and can't be overcome by a user, other than by passing a custom chat template which has `<|begin_of_text|>` manually removed.
3. `client.completion` - the client sorts out the chat template - **BOS is forced always** - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o `<|begin_of_text|>`
4. `client.chat.completions` - the chat template is applied on the server side: here the BOS isn't added twice - **if the template contains `<|begin_of_text|>` it encodes it properly - ending up with a single BOS**

### Expectations and bugs

So for (1) and (3) one could say it's the user's responsibility to strip any BOS tokens in the prompt since a _normal_ prompt is expected here. (normal == pure text w/o any special tokens - as in "Today is")

**(2) is clearly a bug and it's inconsistent with (4)**. With `meta-llama/Meta-Llama-3-8B-Instruct` you would see this logged with (2): `{'prompt_token_ids': [128000, 128000, 128006, 9125, 128007, ...` where `128000` is the BOS token.

(4) used to have this problem but has been fixed in https://github.com/vllm-project/vllm/pull/4688

### Analysis process

The online API already logs the token ids it's about to feed to the model so that was easy. The offline API doesn't do it - so I had to add:

```
diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index 61c21887..71b82baf 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -808,6 +808,9 @@ class LLMEngine:
             lora_request=lora_request,
             prompt_adapter_request=prompt_adapter_request,
         )
+
+        print(preprocessed_inputs)
+
         processed_inputs = self.input_processor(preprocessed_inputs)

         # This is a bit of a hack - copy the mm_processor_kwargs that were
```

Request: is it possible to codify the above diff - so that the user could debug the offline scenario in the same way the online scenario currently logs:

```
INFO 10-18 17:53:34 logger.py:37] Received request cmpl-da6d014eca6e48acb5abb2d2cae39182-0: 
prompt: '<|begin_of_text|><|start_header_id|>..
prompt_token_ids: [128000, 128000, 128006, 9125, 128007...
```

### Needed documentation

Wrt (1) and (3) I'd imagine vllm should have a clear documentation of when it adds BOS forcefully. That is ideally the `prompt` doc needs to say that it must not include `tokenizer.bos_token` (e.g. `<|begin_of_text|>` in many tokenizers).

### Reproduction

To reproduce I was just using your examples:

1. https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference.html
2. https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference_chat.html
etc. but prepended the existing prompt with `<|begin_of_text|>` in the non-chat examples to test.

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519

Your current environment

4 🐛generation scenarios

Summary of BOS injection/duplication

Expectations and bugs

Analysis process

Needed documentation

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication #9519

Description

Your current environment

4 🐛generation scenarios

Summary of BOS injection/duplication

Expectations and bugs

Analysis process

Needed documentation

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions