-
-
Couldn't load subscription status.
- Fork 10.9k
[Metrics] Deprecate TPOT in favor of ITL #24110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metrics] Deprecate TPOT in favor of ITL #24110
Conversation
The only case where we don't want to assert the existance of a metric is where it is deprecated and we're not showing hidden deprecated metrics. Signed-off-by: Mark McLoughlin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly deprecates the vllm:time_per_output_token_seconds (TPOT) metric in favor of the more accurately named vllm:inter_token_latency_seconds (ITL). The changes are consistently applied across the codebase, including metrics definitions, logging, tests, and the Grafana dashboard example. The deprecation strategy of retaining the old metric for backward compatibility while introducing the new one is sound. I've found one minor issue with the documentation of the new metric, which appears to be a copy-paste error.
As per vllm-project#24015, what we currently call as TPOT should instead be called ITL since what we are actually measuring is the time between iterations, and a single iteration can produce multiple tokens. Signed-off-by: Mark McLoughlin <[email protected]>
b176439 to
09dbc43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for updating
* 'main' of https://github.com/845473182/vllm: (457 commits) [BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models (vllm-project#24132) [Misc] Add check for dual_chunk_attention (vllm-project#24070) [Doc]: fix typos in Python comments (vllm-project#24115) [Doc]: fix typos in Python comments (vllm-project#24093) [Compile] Fix Compile Warning for `w4a8_mm_entry.cu` (vllm-project#23660) fix some typos (vllm-project#24071) [V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing (vllm-project#23656) Upgrade xgrammar to 0.1.23 (vllm-project#22988) Update release pipeline post PyTorch 2.8.0 update (vllm-project#24073) [XPU] Fix the bug of LoRA logits on the XPU platform (vllm-project#24081) [CI/Build] Disable SiluMul NVFP4 quant fusion tests (vllm-project#24121) [Bug] R1 Accuracy: Fix `routed_scaling_factor` Double Mul Issue (vllm-project#24119) [AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault (vllm-project#23692) [CI] Enable all hf transformers baselines in test_hybrid (vllm-project#23936) [Log] Only Print Profiler Results on Rank 0 (vllm-project#23370) Fix weights loading for Apertus (vllm-project#24100) [Metrics] Deprecate TPOT in favor of ITL (vllm-project#24110) [Bugfix] Fix packed_factor missing attribute error (vllm-project#23902) Run ruff format on a few files. (vllm-project#24075) [Bugfix] Fix transform_config parsing in Compressed Tensors (vllm-project#23945) ...
Signed-off-by: Mark McLoughlin <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>
As per #24015, what we currently call as TPOT should instead be called ITL since what we are actually measuring is the time between iterations, and a single iteration can produce multiple tokens.
I'm flagging the TPOT metric as deprecated from 0.11 - even if this gets released on a 0.10.x release, I think we should only start the deprecation period from when it gets released in a new minor 0.N.0 release.