-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Description
Your current environment
0.6.3.post1
4 🐛generation scenarios
There are at least 4 generation use cases in vLLM:
- offline generate
- offline chat
- online completion (similar to 1 but online and a totally different implementation)
- online chat completion (similar to 2 but online and a totally different implementation)
It's up to the user whether they want to handle the chat template themselves and then use (1) or (3) or let vllm do the chat template handling and then it's (2) or (4).
Summary of BOS injection/duplication
I have traced all 4 APIs wrt BOS-injection and here is what I see (0.6.3post1):
- offline
generate- the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o<|begin_of_text|> - offline
chat- BOS is still always forced - so generates 2 BOS tokens if the template already has one - this is a BUG and can't be overcome by a user, other than by passing a custom chat template which has<|begin_of_text|>manually removed. client.completion- the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o<|begin_of_text|>client.chat.completions- the chat template is applied on the server side: here the BOS isn't added twice - if the template contains<|begin_of_text|>it encodes it properly - ending up with a single BOS
Expectations and bugs
So for (1) and (3) one could say it's the user's responsibility to strip any BOS tokens in the prompt since a normal prompt is expected here. (normal == pure text w/o any special tokens - as in "Today is")
(2) is clearly a bug and it's inconsistent with (4). With meta-llama/Meta-Llama-3-8B-Instruct you would see this logged with (2): {'prompt_token_ids': [128000, 128000, 128006, 9125, 128007, ... where 128000 is the BOS token.
(4) used to have this problem but has been fixed in #4688
Analysis process
The online API already logs the token ids it's about to feed to the model so that was easy. The offline API doesn't do it - so I had to add:
diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index 61c21887..71b82baf 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -808,6 +808,9 @@ class LLMEngine:
lora_request=lora_request,
prompt_adapter_request=prompt_adapter_request,
)
+
+ print(preprocessed_inputs)
+
processed_inputs = self.input_processor(preprocessed_inputs)
# This is a bit of a hack - copy the mm_processor_kwargs that were
Request: is it possible to codify the above diff - so that the user could debug the offline scenario in the same way the online scenario currently logs:
INFO 10-18 17:53:34 logger.py:37] Received request cmpl-da6d014eca6e48acb5abb2d2cae39182-0:
prompt: '<|begin_of_text|><|start_header_id|>..
prompt_token_ids: [128000, 128000, 128006, 9125, 128007...
Needed documentation
Wrt (1) and (3) I'd imagine vllm should have a clear documentation of when it adds BOS forcefully. That is ideally the prompt doc needs to say that it must not include tokenizer.bos_token (e.g. <|begin_of_text|> in many tokenizers).
Reproduction
To reproduce I was just using your examples:
- https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference.html
- https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference_chat.html
etc. but prepended the existing prompt with<|begin_of_text|>in the non-chat examples to test.
Thank you!