-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[V1] Add num_cached_tokens stats for request output #17519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: simon-mo <[email protected]>
…okens Signed-off-by: simon-mo <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: simon-mo <[email protected]>
|
One caveat is that |
Signed-off-by: simon-mo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw why do we want to have this feature?
| self._all_token_ids: list[int] = self.prompt_token_ids.copy() | ||
| self.spec_token_ids: list[int] = [] | ||
| self.num_computed_tokens = 0 | ||
| self.num_cached_tokens = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo Please add a comment. I think people will be confused between num_computed_tokens and num_cached_tokens otw.
|
I think this exposes implementation details to the API, which is not recommended unless we have a clear use case. |
|
This needs to be piped to the API as part of the protocol here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo Oh I see, I didn't know that the prompt caching api had this.
| self._all_token_ids: list[int] = self.prompt_token_ids.copy() | ||
| self.spec_token_ids: list[int] = [] | ||
| self.num_computed_tokens = 0 | ||
| self.num_cached_tokens = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo Please add a comment. I think people will be confused between num_computed_tokens and num_cached_tokens otw.
|
This pull request has merge conflicts that must be resolved before it can be |
|
@simon-mo Also, I think we can initialize |
|
I didn't notice this fix. I also submitted a PR to address this issue. #18192 😅 |
|
Closing as superseded by #18149 |
V1 never supported this item in request output. so
output.num_cached_tokensfrom LLM.generate is always None. This PR adds support for it.