Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
233 commits
Select commit Hold shift + click to select a range
6d7840c
add fused fp8 bmm
k50112113 Jul 25, 2025
92e134a
add envs
k50112113 Jul 26, 2025
9433b84
api fix for upstream compatibility
divakar-amd Aug 8, 2025
245f2eb
improve env switch. reformat lint
divakar-amd Aug 12, 2025
6fd99d2
nit: formatting and direct aiter fxn call
divakar-amd Aug 12, 2025
c219220
fit fp8 dtype selection
divakar-amd Aug 12, 2025
8017d7d
rm kernel warmup
divakar-amd Aug 14, 2025
3eed848
[Kernel][AMD] Avoid D2H copy and cumsum kernel (#22683)
mxz297 Aug 12, 2025
380030f
[CI][Nixl] Check kv cache layout during handshake (#22745)
NickLucche Aug 12, 2025
d453c1c
Fix torch version check for SM100 mxfp4 (#22535)
zifeitong Aug 12, 2025
e56ec0d
[Misc] parametrize 'dtype' in test_flash_mla (#22641)
RUTHLESS-BOT Aug 12, 2025
5572b49
[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues (#22606)
frankwang28 Aug 12, 2025
194a9fa
[Docs] Hide the navigation and toc sidebars on home page (#22749)
hmellor Aug 13, 2025
b4bcf2b
Fix Transformers backend tensor parallel for multimodal models (#22673)
hmellor Aug 13, 2025
3eca03b
[Model] Decouple glm4v (#22751)
jeejeelee Aug 13, 2025
e8b1986
Add hardware plugins to installation doc (#22732)
mgoin Aug 13, 2025
a48314c
[V0 Deprecation] Remove multi-step scheduling (#22138)
WoosukKwon Aug 13, 2025
1187e50
[Misc] Remove tests/multi_step/__init__.py (#22778)
WoosukKwon Aug 13, 2025
753f655
[V0 Deprecation] Remove args for multi-step scheduling (#22779)
WoosukKwon Aug 13, 2025
19891dc
Fix cuda illegal mem access with Llama4 TP8 + rms_norm custom op (#22…
nvpohanh Aug 13, 2025
61419e9
[Bugfix] Fix default enable for CUTLASS MLA on SM100 (#22738)
mgoin Aug 13, 2025
61b6648
Force TRTLLM attention for gpt-oss on SM100 (#22678)
mgoin Aug 13, 2025
f776e11
Remove unneeded ROCm platform import when using CUDA (#22765)
mgoin Aug 13, 2025
50bd033
[Bug] Fix Unexpected Keyword Argument 'w1_bias' (#22757)
yewentao256 Aug 13, 2025
cbb5508
[Perf] Support topk softmax fused kernel for broader num_experts (#22…
shixianc Aug 13, 2025
dd5c246
[gpt-oss] upgrade gpt-oss to v0.0.3 and add version check (#22768)
heheda12345 Aug 13, 2025
f362240
[Model] Add option to run Step3VisionEncoder in DP (#22697)
zzh142857 Aug 13, 2025
ee22b08
[Model] Add missing prefix to glm4_1v (#22716)
zRzRzRzRzRzRzR Aug 13, 2025
66c5b95
[Bugfix] Fix Nemotron VL image processing (#22739)
ducviet00 Aug 13, 2025
96ddae4
[Doc] Add max_lora_rank configuration guide (#22782)
chi2liu Aug 13, 2025
24fddcf
[V1] Add tree drafting tests for eagle spec decoding (#22705)
TheEpicDolphin Aug 13, 2025
4acdadb
[Platform] Custom ops support for FusedMoe (#22509)
wangxiyuan Aug 13, 2025
3821bba
[Frontend] Add chunked processing to handle long inputs in embedding …
x22x22 Aug 13, 2025
1ddd5e7
[FEATURE] support custom vllm tuned config path for fused moe triton …
vermouth1992 Aug 13, 2025
00f1ba7
[Nixl][CI] Fix tests (#22806)
NickLucche Aug 13, 2025
657eac2
[Bugfix][mamba] Fix type annotation of Mamba2Metadata (#22787)
heheda12345 Aug 13, 2025
dea3291
Remove unnecessary CUDA sync of qwen image and video preprocess (#22792)
cyyever Aug 13, 2025
0b36a38
Fix GGUF loader for Qwen3 MoE. (#22785)
Gh0u1L5 Aug 13, 2025
4112a09
[Frontend] Multithreaded async multimodal load_bytes (#22710)
milesial Aug 13, 2025
8dd4c5a
[Core] Use individual MM items in P0/P1 cache and model runner (#22570)
DarkLight1337 Aug 13, 2025
550ede2
[Misc] clear and separate error messages for input too long and input…
Aug 13, 2025
ba5e94a
[Bugfix] Fix MiniCPMV Image input inference failed (#22813)
jio-H Aug 13, 2025
e2b838c
[CI/Build] Update VLM common tests (#22841)
DarkLight1337 Aug 13, 2025
3f3dd9e
[CI] Fix `tests/v1/e2e/test_kv_sharing_fast_prefill.py` import on tes…
NickLucche Aug 13, 2025
5a1b862
[CI/Build] Fix param mismatch in `test_eagle_correctness` (#22847)
DarkLight1337 Aug 13, 2025
caec847
[CI/Build] Skip gpt_big model test because of broken HF model (#22848)
Isotr0py Aug 13, 2025
1e28f21
[ROCm][Bugfix] Fix compilation error in topk softmax fused kernel (#2…
kliuae Aug 13, 2025
8e2680d
Move checklist in PR template (#22852)
ProExpertProg Aug 13, 2025
72327ed
[Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP…
Jialin Aug 13, 2025
61088be
[CI/Build] Increase pooling tolerance to pass CI (#22844)
DarkLight1337 Aug 13, 2025
a2755cb
[CI][Entrypoints]: add filter to generation to filter out invalid too…
wseaton Aug 14, 2025
db3d9a8
[CI] Fix `tests/distributed/test_ca_buffer_sharing.py` (#22849)
ilmarkov Aug 14, 2025
062888c
[CI] remove flaky v0 test (#22864)
robertgshaw2-redhat Aug 14, 2025
929002e
vLLM Benchmark suite improvement (#22119)
louie-tsai Aug 14, 2025
84229b4
[Bugfix] Fix `PixtralHFImagePixelInputs` dynamic shape check (#22827)
Isotr0py Aug 14, 2025
188855d
[BugFix] Threadsafe close async zmq sockets (#22877)
njhill Aug 14, 2025
6283540
Remove Phi 4 Flash configuration workaround (#22723)
hmellor Aug 14, 2025
01e9a4b
[Bugfix] Add reset prefix cache for online serving (#22726)
iAmir97 Aug 14, 2025
a3328c2
[Doc] fix dead link (#22898)
dtrifiro Aug 14, 2025
39ade99
[CI] Re-enable transcriptions `test_long_audio_request` (#22890)
NickLucche Aug 14, 2025
6e0178e
[Perf] Dont create unnecessary pooling params (#22876)
LucasWilkinson Aug 14, 2025
5c73a22
[Model] Modify the gate implementation of glm4_moe (#22832)
jeejeelee Aug 14, 2025
d2f8a6a
[Bugfix] Replace custom Encoding class with BatchEncoding in MistralT…
ZJY0516 Aug 14, 2025
25a3132
[Bugfix] Fix parsing of `--disable-mm-preprocessor-cache` (#22909)
DarkLight1337 Aug 14, 2025
30056c0
[CI] [Hybrid] Bump min transformers version for Bamba and Jamba (#22…
tdoublep Aug 14, 2025
0b484cb
[Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel …
jinzhen-lin Aug 14, 2025
6631e0b
docs: update fastsafetensors usage instructions (#22891)
NirLevy98 Aug 14, 2025
d3eedd7
[CI] Temporarily disable flaky test (#22930)
LucasWilkinson Aug 14, 2025
9d16e61
[Kernel] Add nvfp4 gemm flashinfer backends (#22346)
nvjullin Aug 14, 2025
a3354b4
[Quantization]: Support compressed-tensors mixed-precision model load…
dsikka Aug 14, 2025
796195b
[Core] Return final response for aborted requests from `AsyncLLM.gene…
njhill Aug 14, 2025
328f344
[BugFix] Fix initial DP request load imbalance (#22910)
njhill Aug 14, 2025
0a8da7e
[Bugfix] use flash attn on sm90 (#22933)
zyongye Aug 14, 2025
1c51e0b
[Kernel] Add cuda kernel for gpt_oss activation (#22538)
jeejeelee Aug 15, 2025
ab5727a
Revert "[Kernel] Add cuda kernel for gpt_oss activation" (#22948)
simon-mo Aug 15, 2025
613b55b
[BugFix][KVConn] Fix use of `get_required_kvcache_layout` (#22734)
njhill Aug 15, 2025
822efc4
[BugFix] Fix port lookup in internal DP LB tests (#22252)
njhill Aug 15, 2025
0955fd8
[CI Perf] Prune tests in `tests/kernels/quantization/` (#22942)
mgoin Aug 15, 2025
77266f1
[CI Perf] Prune tests in `tests/kernels/moe/` (#22939)
mgoin Aug 15, 2025
bf1e1b7
[CI Perf] Prune tests in `tests/kernels/attention/` (#22936)
mgoin Aug 15, 2025
f3ed2f8
refactor: Change scaling factors calculation for flashinfer FusedMoE …
amirkl94 Aug 15, 2025
91a8efc
[Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughp…
yewentao256 Aug 15, 2025
6ee5a05
[Mamba] - refactor: Renamed mamba_attn to mamba2_attn (#22818)
Josephasafg Aug 15, 2025
3e6dfbb
Revert "[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Modul…
tjtanaa Aug 15, 2025
c20e948
[P/D]Provide bucket algorithm rate limiter for proxy_server (#22643)
frankie-ys Aug 15, 2025
5c1e8ce
[CI] Pooling models mteb test uses enforce_eager (#22878)
noooop Aug 15, 2025
676f6b7
[V1] - Split Prefill and Decode for Mamba1 models (#22653)
amirai21 Aug 15, 2025
bceccfa
[Bugfix] Unquote file uri before reading image (#22912)
sayandipdutta Aug 15, 2025
5e58ff1
[Bugfix] fix cuda 12.6 and 11.8 build (#22952)
jinzhen-lin Aug 15, 2025
c6ff233
[MM] Allow skipping memory profiling for multimodal models. (#22950)
Aug 15, 2025
68ce862
Improve multimodal hasher performance for re-used Image prompts (#22825)
p88h Aug 15, 2025
badff24
[V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba…
tdoublep Aug 15, 2025
3825e75
[Misc] Ignore ep_kernels_workspace (#22807)
jeejeelee Aug 15, 2025
e36ea57
[CI] Remove duplicated docs build from buildkite (#22924)
hmellor Aug 15, 2025
e1a5c03
[Frontend] Expose do_log_stats interval to env (#22905)
Csrayz Aug 15, 2025
dbef6b7
[Core] Allow full cudagraph with separate attention routines and orth…
fhl2000 Aug 15, 2025
00771a0
[V0 Deprecation] Remove advance_step (#22969)
WoosukKwon Aug 15, 2025
9fb7223
[BugFix] Skip the Q component for QKVParallelLinear in the case of QK…
sstamenk Aug 15, 2025
c4671db
[FIXBUG] Correctly Apply Grammar Bitmask in Mixed Batches (#22896)
JartX Aug 15, 2025
23d917b
[Benchmarks] Include image data when ShareGPT4V dataset is used. (#22…
huachenheli Aug 15, 2025
f2535c9
[Structured Output] Make the output of structured output example more…
shen-shanshan Aug 15, 2025
6a8c9b1
[Kernels] Clean up FusedMoeMethodBase and modular kernel setup. Remo…
bnellnm Aug 15, 2025
26dc380
[Model] Granite-4 support loading quantized checkpoint (#22925)
cyang49 Aug 15, 2025
1d15d00
[Log] Debug Once for Randomizing dummy data for DP Rank (#22860)
yewentao256 Aug 15, 2025
0a09281
[Core] direct indexing on self.block_table_np in compute_slot_mapping…
linzebing Aug 15, 2025
b2e05e3
[Bugfix] Added more env vars to hash (#22449)
nvjullin Aug 15, 2025
a55302a
Use regex in convert-results-json-to-markdown.py (#22989)
mgoin Aug 15, 2025
8f4c570
[CI] Speed up Whisper tests by reusing server (#22859)
mgoin Aug 15, 2025
244f50a
[Fix] enable swap_ab for pplx problem size computation (#22991)
shixianc Aug 15, 2025
ad60819
Add PrefixRepetitionRandomDataset to `vllm bench serve` datasets (#20…
eicherseiji Aug 15, 2025
a5ef63d
minor: zero workspace buffer init for flashinfer trtllm-gen attn (#22…
yyihuang Aug 15, 2025
d07caa6
[Attention] FA3 Attention Sinks Perf Boost (#22478)
LucasWilkinson Aug 15, 2025
1a8dba4
[BugFix] Fix regression caused by mamba state dtype PR (#22998)
tdoublep Aug 15, 2025
0bb7e0b
ci: Add CUDA + arm64 release builds (#21201)
seemethere Aug 15, 2025
c17354f
[Structured Outputs] [Bug] Fix misalignment in apply_grammar_bitmask …
rishitdholakia13 Aug 15, 2025
c587b1b
[BugFix] Handle case where async utility call is cancelled (#22996)
njhill Aug 15, 2025
49a48d0
[v1] Move block_hashes from KVCacheManager to Request.block_hashes (#…
orozery Aug 15, 2025
6267ed1
Support multiple attention groups for KV sharing (#22672)
sarckk Aug 15, 2025
706e57c
[BugFix] Make `run_once` thread-safe (#22978)
oraluben Aug 15, 2025
fac3fcb
[Misc] Support passing multiple request ids at once to `AsyncLLM.abor…
njhill Aug 16, 2025
e6968c3
[Kernel] Simplify `get_kv_cache_layout` and cache `use_trtllm_attenti…
NickLucche Aug 16, 2025
d0dd871
[Bugfix] Fix DeepSeek MTP (#22934)
benchislett Aug 16, 2025
e161da7
[Frontend] Avoid list copies in `serving_chat.py` (#22947)
njhill Aug 16, 2025
b385101
[V1] support min_tokens for detokener (#22014)
calvin0327 Aug 16, 2025
2181fd8
[misc] nsys profile output kernel classifier and visualizer (#22971)
gracehonv Aug 16, 2025
a94dddd
[XPU]avoid circular import during XPU init (#23017)
jikunshang Aug 16, 2025
f4e73d0
[Build] Env var to disable sccache (#22968)
LucasWilkinson Aug 16, 2025
71d2a2b
[BugFix] Add support for loading prompt embeds tensors serialized on …
qthequartermasterman Aug 16, 2025
252a427
[Misc] Add --save-dir option to benchmark_moe (#23020)
jeejeelee Aug 16, 2025
b29cac8
[Multimodal] Update Tensor schema test to cover arbitrary shape mm in…
Isotr0py Aug 16, 2025
7c7c0fd
[Core] Make cudagraph check cuda platform only (#23005)
yaochengji Aug 16, 2025
0fc05d2
[CI][Bugfix] Skip Ovis2 generation test because of broken remote code…
Isotr0py Aug 16, 2025
1a2838c
Add docs for PrefixRepetitionDataset + enable usage with `vllm bench …
eicherseiji Aug 16, 2025
91ba5e6
[Refactor] Allow optional MultiModalKwargsItem in IPC (#23022)
DarkLight1337 Aug 16, 2025
a7ec9e0
[New Model]mBART model (#22883)
princepride Aug 16, 2025
37b5459
Fix handling of `max_num_batched_tokens` for pooling tasks (#23004)
maxdebayser Aug 16, 2025
0addb6c
[Frontend] Added support for HermesToolParser for models without spec…
minpeter Aug 16, 2025
111f5ee
[Bugfix gpt-oss] Fix float32 convert for flashinfer sink support (#23…
mgoin Aug 16, 2025
98e9357
[Flaky CI] Increase timeout tolerance for test_mp_crash_detection+tes…
mgoin Aug 16, 2025
417a255
[Kernel/Quant] Remove AQLM (#22943)
mgoin Aug 16, 2025
8ff4603
[V1] Logits processors extensibility (#19912)
afeldman-nm Aug 16, 2025
7a3b649
[Bugfix] fix qwen3 moe fp8 accuracy issue (#23031)
jinzhen-lin Aug 17, 2025
023ec4d
[UX] Separate marlin moe config logic from triton moe (#23006)
mgoin Aug 17, 2025
ef4e620
[Refactor] Defer tensor data construction in MultiModalKwargs (#23030)
DarkLight1337 Aug 17, 2025
f11a7fc
[Misc] method name typo fix (#23042)
andyxning Aug 17, 2025
2f80578
[Kernel] Add cuda kernel for gpt_oss activation (#22951)
jeejeelee Aug 17, 2025
7466a82
[Bugfix] should use stack instead of concat (#22972)
947132885 Aug 17, 2025
1b10e1f
[Misc] fix typo in the multimodal doc (#23051)
KevinZeng08 Aug 17, 2025
dd065c4
[BugFix] Fix for IMA in FA3 varlen combine (#22967)
LucasWilkinson Aug 17, 2025
bf9e37e
[Misc] Remove dead return (#23061)
WoosukKwon Aug 17, 2025
d969012
[Misc] Convert use_structured_output property into constant (#23060)
WoosukKwon Aug 17, 2025
24c5ed4
[XPU] fix xpu to set cudagraph batch sizes (#23044)
calvin0327 Aug 17, 2025
0f94eec
fix: gptq marlin weight loading failure (#23066)
simon-mo Aug 17, 2025
d2add76
[Misc] Minor code cleanup for _get_prompt_logprobs_dict (#23064)
WoosukKwon Aug 18, 2025
de83bef
[Misc] enhance static type hint (#23059)
andyxning Aug 18, 2025
6acd301
[Bugfix] fix Qwen2.5-Omni processor output mapping (#23058)
DoubleVII Aug 18, 2025
5812d25
[Bugfix][CI] Machete kernels: deterministic ordering for more cache h…
andylolu2 Aug 18, 2025
46d4061
[Misc] refactor function name (#23029)
andyxning Aug 18, 2025
96ebb74
[Misc] Fix backward compatibility from #23030 (#23070)
ywang96 Aug 18, 2025
dfbf77b
[XPU] Fix compile size for xpu (#23069)
jikunshang Aug 18, 2025
b2ca8af
[XPU][CI]add xpu env vars in CI scripts (#22946)
jikunshang Aug 18, 2025
09871a7
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwarg…
DarkLight1337 Aug 18, 2025
a661738
[Bugfix] fix IntermediateTensors equal method (#23027)
andyxning Aug 18, 2025
a775386
[Refactor] Get prompt updates earlier (#23097)
DarkLight1337 Aug 18, 2025
dd903b7
chore: remove unnecessary patch_padding_side for the chatglm model (#…
carlory Aug 18, 2025
c2eef8f
[Bugfix] Support compile for Transformers multimodal (#23095)
zucchini-nlp Aug 18, 2025
b80575a
[CI Bugfix] Pin `openai<1.100` to unblock CI (#23118)
mgoin Aug 18, 2025
3f93171
fix: OpenAI SDK compat (ResponseTextConfig) (#23126)
h-brenoskuk Aug 18, 2025
9c4f6b3
Use Blackwell FlashInfer MXFP4 MoE by default if available (#23008)
mgoin Aug 18, 2025
2c29786
Install tpu_info==0.4.0 to fix core dump for TPU (#23135)
xiangxu-google Aug 18, 2025
d4356b2
[Misc] Minor refactoring for prepare_inputs (#23116)
WoosukKwon Aug 18, 2025
f4aa6ed
[Spec Decode] Make `propose_draft_token_ids` non-blocking for lower T…
WoosukKwon Aug 19, 2025
ba75cb1
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-atten…
tdoublep Aug 19, 2025
3600fb9
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Cachi…
robertgshaw2-redhat Aug 19, 2025
531a03c
[V0 Deprecation] Remove V0 FlashInfer attention backend (#22776)
WoosukKwon Aug 19, 2025
e11cc33
chore: disable enable_cpp_symbolic_shape_guards (#23048)
xiszishu Aug 19, 2025
b18be81
[TPU] make ptxla not imported when using tpu_commons (#23081)
yaochengji Aug 19, 2025
7bdd05d
[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes (#22725)
nikheal2 Aug 19, 2025
ceed16d
Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema …
bbeckca Aug 19, 2025
4bb9728
[Log] Warning Once for Cutlass MLA (#23137)
yewentao256 Aug 19, 2025
7ab9673
[Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Think…
ZJY0516 Aug 19, 2025
fadc4ff
[misc] split engine_model into json file for nsys profile tool (#23117)
gracehonv Aug 19, 2025
838152c
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_t…
pliops-daniels Aug 19, 2025
34fbb33
Fix GLM-4.5V-FP8 numerical issue (#22949)
zixi-qi Aug 19, 2025
fa0e55e
[Misc] Add request_id into benchmark_serve.py (#23065)
hustxiayang Aug 19, 2025
04450e0
[Bugfix] Fix broken Minimax-01-VL model (#22116)
Isotr0py Aug 19, 2025
66bfe25
[bug fix] Fix llama4 spec decoding (#22691)
zixi-qi Aug 19, 2025
c0e72a4
[Misc] Avoid accessing req_ids inside a loop (#23159)
WoosukKwon Aug 19, 2025
67c9cf8
[Doc] use power of 2 (#23172)
Tialo Aug 19, 2025
db4ed3e
[Misc] Fix seq_lens for graph capture (#23175)
WoosukKwon Aug 19, 2025
a1ac4ba
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel (#21…
elvischenv Aug 19, 2025
c90e5ec
[Model] Add multi_label_classification support (#23173)
noooop Aug 19, 2025
33fe9b6
[Model] support new model ovis2.5 (#23084)
myselvess Aug 19, 2025
c316a3f
[Bugfix] Fix benchmark_moe.py (#23177)
jeejeelee Aug 19, 2025
fd502c6
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL (#22742)
tjtanaa Aug 19, 2025
1e5610f
[Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBl…
yiz-liu Aug 19, 2025
427704a
Add return_token_ids parameter to OpenAI API endpoints (#22587)
ultmaster Aug 19, 2025
19f6bb7
Migrate LlavaOnevisionMultiInputs to TensorSchema (#21844)
bbeckca Aug 19, 2025
f857b82
[CI/Build] Update transformers to v4.55.2 (#23093)
Isotr0py Aug 19, 2025
8a0342f
[Misc] Fix the benchmark's README and improve the error messages for …
tanruixiang Aug 19, 2025
88c6bc0
[Frontend] Add `/collective_rpc` API endpoint (#23075)
22quinn Aug 19, 2025
e80cc3f
[Misc] Enable yapf for FlashInfer backend (#23193)
WoosukKwon Aug 19, 2025
dbaa4b4
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 a…
bnellnm Aug 19, 2025
fc9dfb2
fix: use cache_salt for gpt-oss (#23186)
dr75 Aug 19, 2025
62ae27d
[Misc] Minor refactoring for FlashInfer backend (#23147)
WoosukKwon Aug 19, 2025
6013d5d
[CI/Build] Add support for Python 3.13 (#13164)
mgoin Aug 19, 2025
0e5d352
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend (#22357)
amirkl94 Aug 19, 2025
e874de0
[CI/Build] Replace lm-eval gsm8k tests with faster implementation (#2…
mgoin Aug 19, 2025
661df59
[BugFix] fix CUTLASS MLA full cudagraph (#23200)
LucasWilkinson Aug 19, 2025
3c95c62
[Benchmarks] Add video inputs to ShareGPTDataset. (#23199)
huachenheli Aug 19, 2025
1288ca2
[Quantization] Bump Compressed Tensors Version (#23202)
kylesayrs Aug 20, 2025
42b5326
[Core] Optimize scheduler request removal for single completions (#21…
chi2liu Aug 20, 2025
876a74a
[CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce…
mgoin Aug 20, 2025
190d370
[Core] Add torch profiler CPU traces for AsyncLLM. (#21794)
huachenheli Aug 20, 2025
ea1c0aa
[Doc] Update V1 status of various pooling models (#23189)
DarkLight1337 Aug 20, 2025
e85d1a2
[Attention] Optimize make_local_attention_virtual_batches for Flash A…
linzebing Aug 20, 2025
622f95f
Fix a performance comparison issue in Benchmark Suite (#23047)
louie-tsai Aug 20, 2025
4bd618e
chore: support pytorch format in lora (#22790)
KilJaeeun Aug 20, 2025
3ccea69
[CI/Build] Also check DP in benchmarks throughput script (#23038)
zhewenl Aug 20, 2025
dc1351a
[CI/Build] Sync multimodal tests (#23181)
DarkLight1337 Aug 20, 2025
69e35c0
[BugFix] Fix stuck stats/metrics after requests are aborted (#22995)
njhill Aug 20, 2025
507232a
fix cuda graph (#22721)
fsx950223 Aug 20, 2025
431b380
[Model] use autoWeightsLoader for gptoss (#22446)
calvin0327 Aug 20, 2025
8a60206
Fix missing quotes (#23242)
wzshiming Aug 20, 2025
f5d48f8
[Model] Support deepseek with eagle (#21086)
xyang16 Aug 20, 2025
2f31e73
[Bugfix] Ensure correctness of Cohere2Vision processing (#23245)
DarkLight1337 Aug 20, 2025
d9d4f40
Update to flashinfer-python==0.2.12 and disable AOT compile for non-r…
mgoin Aug 20, 2025
86b2c91
[Model][V1] Support Ernie MTP (#22169)
xyxinyang Aug 20, 2025
5cbe9c2
[Model] Improve olmo and olmo2 (#23228)
jeejeelee Aug 20, 2025
11b0bcd
[Fix] fix offline env use local mode path (#22526)
lengrongfu Aug 20, 2025
922b71b
[Bugfix] Ensure correctness of HCXVision processing (#23254)
DarkLight1337 Aug 20, 2025
5e59970
add envs
k50112113 Jul 26, 2025
64c11d6
improve env switch. reformat lint
divakar-amd Aug 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
32 changes: 12 additions & 20 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This directory contains two sets of benchmark for vllm.
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.

See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
See [vLLM performance dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.

## Performance benchmark quick overview

Expand Down Expand Up @@ -138,28 +138,20 @@ The raw benchmarking results (in the format of json files) are in the `Artifacts

The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.

Here is an example using the script to compare result_a and result_b without detail test name.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json --ignore_test_name`

| | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|----------------------------------------|----------------------------------------|----------|
| 0 | 142.633982 | 156.526018 | 1.097396 |
| 1 | 241.620334 | 294.018783 | 1.216863 |
| 2 | 218.298905 | 262.664916 | 1.203235 |
| 3 | 242.743860 | 299.816190 | 1.235113 |

Here is an example using the script to compare result_a and result_b with detail test name.
Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output lenght, max concurrency and qps.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`

| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
| 1 | serving_llama8B_tp1_sharegpt_qps_16 | 241.620334 | serving_llama8B_tp1_sharegpt_qps_16 | 294.018783 | 1.216863 |
| 2 | serving_llama8B_tp1_sharegpt_qps_4 | 218.298905 | serving_llama8B_tp1_sharegpt_qps_4 | 262.664916 | 1.203235 |
| 3 | serving_llama8B_tp1_sharegpt_qps_inf | 242.743860 | serving_llama8B_tp1_sharegpt_qps_inf | 299.816190 | 1.235113 |
| 4 | serving_llama8B_tp2_random_1024_128_qps_1 | 96.613390 | serving_llama8B_tp4_random_1024_128_qps_1 | 108.404853 | 1.122048 |
| | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------|
| 0 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982 | 156.526018 | 1.097396 |
| 1 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334 | 294.018783 | 1.216863 |

A comparison diagram will be generated below the table.
Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
<img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />

## Nightly test details

Expand Down
291 changes: 266 additions & 25 deletions .buildkite/nightly-benchmarks/scripts/compare-json-results.py
Original file line number Diff line number Diff line change
@@ -1,33 +1,202 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
import json
import os
from importlib import util

import pandas as pd

plotly_found = util.find_spec("plotly.express") is not None


def compare_data_columns(
files, name_column, data_column, drop_column, ignore_test_name=False
files, name_column, data_column, info_cols, drop_column, debug=False
):
print("\ncompare_data_column: " + data_column)
"""
Align concatenation by keys derived from info_cols instead of row order.
- Pick one canonical key list: subset of info_cols present in ALL files.
- For each file: set index to those keys, aggregate duplicates
- (mean for metric, first for names).
- Concat along axis=1 (indexes align), then reset_index so callers can
- group by columns.
- If --debug, add a <file_label>_name column per file.
"""
print("\ncompare_data_column:", data_column)

frames = []
raw_data_cols = []
compare_frames = []

# 1) choose a canonical key list from info_cols that exists in ALL files
cols_per_file = []
for f in files:
try:
df_tmp = pd.read_json(f, orient="records")
except Exception as err:
raise ValueError(f"Failed to read {f}") from err
cols_per_file.append(set(df_tmp.columns))

key_cols = [c for c in info_cols if all(c in cset for cset in cols_per_file)]
if not key_cols:
# soft fallback: use any info_cols present in the first file
key_cols = [c for c in info_cols if c in list(cols_per_file[0])]
if not key_cols:
raise ValueError(
"No common key columns found from info_cols across the input files."
)

# 2) build a single "meta" block (keys as columns) once, aligned by the key index
meta_added = False

for file in files:
data_df = pd.read_json(file)
serving_df = data_df.dropna(subset=[drop_column], ignore_index=True)
if ignore_test_name is False:
serving_df = serving_df.rename(columns={name_column: file + "_name"})
frames.append(serving_df[file + "_name"])
serving_df = serving_df.rename(columns={data_column: file})
frames.append(serving_df[file])
compare_frames.append(serving_df[file])
df = pd.read_json(file, orient="records")

# Keep rows that actually have the compared metric (same as original behavior)
if drop_column in df.columns:
df = df.dropna(subset=[drop_column], ignore_index=True)

# Stabilize numeric key columns (harmless if missing)
for c in (
"Input Len",
"Output Len",
"TP Size",
"PP Size",
"# of max concurrency.",
"qps",
):
if c in df.columns:
df[c] = pd.to_numeric(df[c], errors="coerce")

# Ensure all key columns exist
for c in key_cols:
if c not in df.columns:
df[c] = pd.NA

# Set index = key_cols and aggregate duplicates → unique MultiIndex
df_idx = df.set_index(key_cols, drop=False)

# meta (key columns), unique per key
meta = df_idx[key_cols]
if not meta.index.is_unique:
meta = meta.groupby(level=key_cols, dropna=False).first()

# metric series for this file, aggregated to one row per key
file_label = "/".join(file.split("/")[:-1]) or os.path.basename(file)
s = df_idx[data_column]
if not s.index.is_unique:
s = s.groupby(level=key_cols, dropna=False).mean()
s.name = file_label # column label like original

# add meta once (from first file) so keys are the leftmost columns
if not meta_added:
frames.append(meta)
meta_added = True

# (NEW) debug: aligned test-name column per file
if debug and name_column in df_idx.columns:
name_s = df_idx[name_column]
if not name_s.index.is_unique:
name_s = name_s.groupby(level=key_cols, dropna=False).first()
name_s.name = f"{file_label}_name"
frames.append(name_s)

frames.append(s)
raw_data_cols.append(file_label)
compare_frames.append(s)

# Generalize ratio: for any file N>=2, add ratio (fileN / file1)
if len(compare_frames) >= 2:
# Compare numbers among two files
ratio_df = compare_frames[1] / compare_frames[0]
frames.append(ratio_df)
compare_frames.pop(1)
base = compare_frames[0]
current = compare_frames[-1]
ratio = current / base
ratio = ratio.mask(base == 0) # avoid inf when baseline is 0
ratio.name = f"Ratio 1 vs {len(compare_frames)}"
frames.append(ratio)

# 4) concat on columns with aligned MultiIndex;
# then reset_index to return keys as columns
concat_df = pd.concat(frames, axis=1)
return concat_df
concat_df = concat_df.reset_index(drop=True).reset_index()
if "index" in concat_df.columns:
concat_df = concat_df.drop(columns=["index"])

# Ensure key/info columns appear first (in your info_cols order)
front = [c for c in info_cols if c in concat_df.columns]
rest = [c for c in concat_df.columns if c not in front]
concat_df = concat_df[front + rest]

print(raw_data_cols)
return concat_df, raw_data_cols


def split_json_by_tp_pp(
input_file: str = "benchmark_results.json", output_root: str = "."
) -> list[str]:
"""
Split a benchmark JSON into separate folders by (TP Size, PP Size).

Creates: <output_root>/tp{TP}_pp{PP}/benchmark_results.json
Returns: list of file paths written.
"""
# Load JSON data into DataFrame
with open(input_file, encoding="utf-8") as f:
data = json.load(f)

# If the JSON is a dict with a list under common keys, use that list
if isinstance(data, dict):
for key in ("results", "serving_results", "benchmarks", "data"):
if isinstance(data.get(key), list):
data = data[key]
break

df = pd.DataFrame(data)

# Keep only "serving" tests
name_col = next(
(c for c in ["Test name", "test_name", "Test Name"] if c in df.columns), None
)
if name_col:
df = df[
df[name_col].astype(str).str.contains(r"serving", case=False, na=False)
].copy()

# Handle alias column names
rename_map = {
"tp_size": "TP Size",
"tensor_parallel_size": "TP Size",
"pp_size": "PP Size",
"pipeline_parallel_size": "PP Size",
}
df.rename(
columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True
)

# Ensure TP/PP columns exist (default to 1 if missing)
if "TP Size" not in df.columns:
df["TP Size"] = 1
if "PP Size" not in df.columns:
df["PP Size"] = 1

# make sure TP/PP are numeric ints with no NaN
df["TP Size"] = (
pd.to_numeric(df.get("TP Size", 1), errors="coerce").fillna(1).astype(int)
)
df["PP Size"] = (
pd.to_numeric(df.get("PP Size", 1), errors="coerce").fillna(1).astype(int)
)

# Split into separate folders
saved_paths: list[str] = []
for (tp, pp), group_df in df.groupby(["TP Size", "PP Size"], dropna=False):
folder_name = os.path.join(output_root, f"tp{int(tp)}_pp{int(pp)}")
os.makedirs(folder_name, exist_ok=True)
filepath = os.path.join(folder_name, "benchmark_results.json")
group_df.to_json(filepath, orient="records", indent=2, force_ascii=False)
print(f"Saved: {filepath}")
saved_paths.append(filepath)

return saved_paths


if __name__ == "__main__":
Expand All @@ -36,31 +205,103 @@ def compare_data_columns(
"-f", "--file", action="append", type=str, help="input file name"
)
parser.add_argument(
"--ignore_test_name", action="store_true", help="ignore_test_name or not"
"--debug", action="store_true", help="show all information for debugging"
)
parser.add_argument(
"--plot",
action=argparse.BooleanOptionalAction,
default=True,
help="plot perf diagrams or not --no-plot --plot",
)
parser.add_argument(
"-x",
"--xaxis",
type=str,
default="# of max concurrency.",
help="column name to use as X Axis in comparision graph",
)
args = parser.parse_args()
files = args.file
print("comparing : " + ", ".join(files))

drop_column = "P99"
name_column = "Test name"
info_cols = [
"Model",
"Dataset Name",
"Input Len",
"Output Len",
"TP Size",
"PP Size",
"# of max concurrency.",
"qps",
]
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
html_msgs_for_data_cols = [
"Compare Output Tokens /n",
"Median TTFT /n",
"Median TPOT /n",
]
ignore_test_name = args.ignore_test_name

if len(args.file) == 1:
files = split_json_by_tp_pp(args.file[0], output_root="splits")
info_cols = [c for c in info_cols if c not in ("TP Size", "PP Size")]
else:
files = args.file
print("comparing : " + ", ".join(files))
debug = args.debug
plot = args.plot
# For Plot feature, assign y axis from one of info_cols
y_axis_index = info_cols.index(args.xaxis) if args.xaxis in info_cols else 6
with open("perf_comparison.html", "w") as text_file:
for i in range(len(data_cols_to_compare)):
output_df = compare_data_columns(
output_df, raw_data_cols = compare_data_columns(
files,
name_column,
data_cols_to_compare[i],
info_cols,
drop_column,
ignore_test_name=ignore_test_name,
debug=debug,
)
print(output_df)
html = output_df.to_html()
text_file.write(html_msgs_for_data_cols[i])
text_file.write(html)

# For Plot feature, insert y axis from one of info_cols
raw_data_cols.insert(0, info_cols[y_axis_index])

filtered_info_cols = info_cols[:-2]
existing_group_cols = [
c for c in filtered_info_cols if c in output_df.columns
]
if not existing_group_cols:
raise ValueError(
f"No valid group-by columns "
f"Expected subset: {filtered_info_cols}, "
f"but DataFrame has: {list(output_df.columns)}"
)
output_df_sorted = output_df.sort_values(by=existing_group_cols)
output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
for name, group in output_groups:
html = group.to_html()
text_file.write(html_msgs_for_data_cols[i])
text_file.write(html)

if plot and plotly_found:
import plotly.express as px

df = group[raw_data_cols]
df_sorted = df.sort_values(by=info_cols[y_axis_index])
# Melt DataFrame for plotting
df_melted = df_sorted.melt(
id_vars=info_cols[y_axis_index],
var_name="Configuration",
value_name=data_cols_to_compare[i],
)
title = data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
# Create Plotly line chart
fig = px.line(
df_melted,
x=info_cols[y_axis_index],
y=data_cols_to_compare[i],
color="Configuration",
title=title,
markers=True,
)
# Export to HTML
text_file.write(fig.to_html(full_html=True, include_plotlyjs="cdn"))
Loading