Skip to content

Conversation

@toslunar
Copy link
Contributor

@toslunar toslunar commented Dec 23, 2024

This PR updates chat template examples to align with the latest (>=0.4.3) /v1/chat/completions spec on BOS.

Some chat templates don't contain bos_token, especially when it's written before v0.4.3. According to the excellent investigation #9519, the chat serving does not add BOS again, while awkwardness remains in the offline chat interface (to be fixed).

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

Thanks for fixing, can you add some unit tests to ensure that the BOS token isn't added twice in this case?

@toslunar
Copy link
Contributor Author

Added a test.

Is it OK to place it to a new directory (tests/examples_tests) because it seems the first time to test files in examples? I named it examples_tests because it shouldn't be confused with data directories like tests/data and tests/prompts.

@DarkLight1337
Copy link
Member

I would just name the directory tests/examples. Can you add the tests to CI by editing this file?

Signed-off-by: Toshiki Kataoka <[email protected]>
Signed-off-by: Toshiki Kataoka <[email protected]>
@mergify mergify bot added the ci/build label Dec 24, 2024
Signed-off-by: Toshiki Kataoka <[email protected]>
@DarkLight1337
Copy link
Member

Please update the failing tests to consider the BOS token.

@toslunar
Copy link
Contributor Author

Oops, this is unexpected to me. I'll fix the enumeration of Jinja files.

https://buildkite.com/vllm/fastcheck/builds/10539#0193fc47-aa1b-4af4-8c53-a27c0af8deb1

[2024-12-25T06:56:26Z] ============================= test session starts ==============================
[2024-12-25T06:56:26Z] platform linux -- Python 3.12.8, pytest-8.3.3, pluggy-1.5.0 -- /usr/bin/python3
[2024-12-25T06:56:26Z] cachedir: .pytest_cache
[2024-12-25T06:56:26Z] rootdir: /vllm-workspace
[2024-12-25T06:56:26Z] configfile: pyproject.toml
[2024-12-25T06:56:26Z] plugins: anyio-4.6.2.post1, buildkite-test-collector-0.1.9, asyncio-0.24.0, forked-1.6.0, rerunfailures-14.0, shard-0.1.2
[2024-12-25T06:56:26Z] asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 2 items                                                              
[2024-12-25T06:56:26Z] Running 2 items in this shard: tests/examples/test_jinja.py::test_bos[1-path0], tests/examples/test_jinja.py::test_bos[3-path0]
[2024-12-25T06:56:26Z]
[2024-12-25T06:56:26Z] examples/test_jinja.py::test_bos[1-path0] SKIPPED (got empty paramet...)
[2024-12-25T06:56:26Z] examples/test_jinja.py::test_bos[3-path0] SKIPPED (got empty paramet...)
[2024-12-25T06:56:26Z]
[2024-12-25T06:56:26Z] ============================== 2 skipped in 0.03s ==============================

Signed-off-by: Toshiki Kataoka <[email protected]>
@DarkLight1337
Copy link
Member

I think you also need to update the entrypoints tests to consider the BOS token.

@DarkLight1337
Copy link
Member

Actually now that I think more about it... wouldn't it be better to instead update LLM.chat to automatically add BOS tokens? That way our chat templates would remain consistent with HF.

Signed-off-by: Toshiki Kataoka <[email protected]>
@mergify
Copy link

mergify bot commented Dec 25, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @toslunar.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 25, 2024
@mergify mergify bot removed the needs-rebase label Dec 25, 2024
Signed-off-by: Toshiki Kataoka <[email protected]>
@toslunar
Copy link
Contributor Author

I suppose this PR could be incremental and it only resolves the inconsistency among example templates.

Actually now that I think more about it... wouldn't it be better to instead update LLM.chat to automatically add BOS tokens? That way our chat templates would remain consistent with HF.

Sorry. I didn't get the concrete suggestion. Do you mean making LLM.chat add a BOS token if there's not (for backward compatibility)?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 26, 2024

I suppose this PR could be incremental and it only resolves the inconsistency among example templates.

Actually now that I think more about it... wouldn't it be better to instead update LLM.chat to automatically add BOS tokens? That way our chat templates would remain consistent with HF.

Sorry. I didn't get the concrete suggestion. Do you mean making LLM.chat add a BOS token if there's not (for backward compatibility)?

I mean that we can remove the BOS token before passing the prompt text into LLM.generate inside LLM.chat so that we don't get duplicate BOS tokens if the chat template already has one. In other words, we update LLM.chat to have similar behavior as online inference, rather than changing the chat templates themselves.

BREAKING CHANGE: chat_template usually needs BOS now.

Signed-off-by: Toshiki Kataoka <[email protected]>
@toslunar
Copy link
Contributor Author

Let me run tests with apply_hf_chat_template(..., tokenize=True).

@mergify mergify bot added the frontend label Dec 26, 2024
@toslunar
Copy link
Contributor Author

toslunar commented Jan 7, 2025

It seems the current unittest doesn't have opinion on BOS in LLM.chat because the CI passes. I'm sorry to say that neither am I interested in the specification of LLM.chat enough to become a decision-maker.

My concerns are what algorithm to decide the need of BOS for backward compatibility and how to notify the choice to users. I'd ask for comments. My random thoughts are below.

For users who trust the HF chat behavior, I'd like to see at least a warning message if vLLM encodes inputs differently than transformers.PreTrainedTokenizerBase.apply_chat_template, whose default is tokenize=True. Besides the current LLM.chat is non-compatible, a possible update with the BOS fallback could conditionally emit "BOS is automatically added for backward compatibility because the chat template didn't produce BOS at the beginning." In my experience, I compared settings to give a system prompt after or before BOS, where such fallback will change the result.

For users who trust vLLM's existing behavior, fallback BOS is necessary and I'd like to be notified on release notes. I suppose it's too noisy to emit a warning message if the fallback occurs and emit another warning message otherwise. In another experience of mine, I had to rerun a bunch of API chat generation because I naively updated the vLLM version and the change by #4688 was almost silent.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 7, 2025

Personally, I prefer aligning with HF's behavior since that's what most people are used to. If we find that BOS would be missing using the original code, we can emit a warning + release notes as you stated.

@robertgshaw2-neuralmagic @njhill any thoughts on this? To unify the tokenization code, I think we should actually tokenize the prompt inside Preprocessor instead of OpenAIServing according to the requirements of request.max_completion_tokens/request.max_tokens, and a flag to truncate the tokens instead of raising an error if max_model_len is exceeded.

@ywang96
Copy link
Member

ywang96 commented Jan 9, 2025

It seems the current unittest doesn't have opinion on BOS in LLM.chat because the CI passes. I'm sorry to say that neither am I interested in the specification of LLM.chat enough to become a decision-maker.

My concerns are what algorithm to decide the need of BOS for backward compatibility and how to notify the choice to users. I'd ask for comments. My random thoughts are below.

For users who trust the HF chat behavior, I'd like to see at least a warning message if vLLM encodes inputs differently than transformers.PreTrainedTokenizerBase.apply_chat_template, whose default is tokenize=True. Besides the current LLM.chat is non-compatible, a possible update with the BOS fallback could conditionally emit "BOS is automatically added for backward compatibility because the chat template didn't produce BOS at the beginning." In my experience, I compared settings to give a system prompt after or before BOS, where such fallback will change the result.

For users who trust vLLM's existing behavior, fallback BOS is necessary and I'd like to be notified on release notes. I suppose it's too noisy to emit a warning message if the fallback occurs and emit another warning message otherwise. In another experience of mine, I had to rerun a bunch of API chat generation because I naively updated the vLLM version and the change by #4688 was almost silent.

@toslunar To clarify, are you talking about LLM.chat or the OpenAI chat API?

@tranhd95
Copy link

Hi, I would appreciate if there would be at least add_special_tokens parameter for LLM.chat (or SamplingParams) so we don't have to manually modify the chat templates. Adding special tokens to the template seems like a preferred way according to HF's chat template docs.

This inconsistency between LLM.chat and vLLM's OpenAI Chat Completions API caught me off guard. And I personally think that there's bunch of people using suboptimal Llamas (and other LLMs with special tokens) because of duplicated BOS tokens.

Ideally I'd stick to Andrej's suggestions to strictly disable every "convenience" shortcuts that HF tokenizers can do. In this case add_special_tokens=True.

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 28, 2025
@mergify
Copy link

mergify bot commented Feb 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @toslunar.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Jul 26, 2025
@github-actions
Copy link

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation frontend needs-rebase stale Over 90 days of inactivity tool-calling

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants