Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
226 commits
Select commit Hold shift + click to select a range
f4b308f
Add gfx950 to the attention archs
jpvillam-amd Apr 3, 2025
e201e58
Linter
jpvillam-amd Apr 10, 2025
ae144d6
custom all-reduce, gfx950
seungrokj Apr 24, 2025
0bd7f8f
Bump Transformers to 4.51.3 (#17116)
hmellor Apr 25, 2025
423e9f1
Use Transformers helper `get_text_config()` instead of checking for `…
hmellor Apr 25, 2025
df5c879
[doc] update wrong hf model links (#17184)
reidliu41 Apr 25, 2025
9d98ab5
[Misc] Inline Molmo requirements (#17190)
DarkLight1337 Apr 25, 2025
a5450f1
[Security] Use safe serialization and fix zmq setup for mooncake pipe…
russellb Apr 25, 2025
48cb210
[V1] Move usage stats to worker and start logging TPU hardware (#16211)
dyli-google Apr 25, 2025
43faa04
[Bugfix] Fix hybrid model tests (#17182)
DarkLight1337 Apr 25, 2025
65e262b
Fix Python packaging edge cases (#17159)
tiran Apr 25, 2025
7011645
[BugFix][Frontend] Fix `LLM.chat()` tokenization (#16081)
njhill Apr 25, 2025
a0e619e
[V1][Spec Decode] EAGLE-3 Support (#16937)
benchislett Apr 25, 2025
c53e073
[Misc] Refine ray_serve_deepseek example (#17204)
ruisearch42 Apr 25, 2025
8de2901
[Bugfix] gemma[2,3] interleaved attention when sliding window is disa…
heheda12345 Apr 26, 2025
68af5f6
[AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it i…
rasmith Apr 26, 2025
5e83a72
[v1] [P/D] Adding LMCache KV connector for v1 (#16625)
ApostaC Apr 26, 2025
a6e72e1
[Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env (#17142)
jamesjwu Apr 26, 2025
c8e5be3
[MISC][AMD] Add unused annotation to rocm kernel file (#17097)
houseroad Apr 26, 2025
537d5ee
[doc] add Anything LLM integration (#17216)
reidliu41 Apr 26, 2025
1cf0719
[Minor][Spec Decode] Add use_eagle to SpeculativeConfig (#17213)
WoosukKwon Apr 26, 2025
7bd0c77
[Doc] Minor fix for the vLLM TPU setup page (#17206)
yarongmu-google Apr 26, 2025
b278911
[Minor][Models] Fix Return Types of Llama & Eagle (#17220)
WoosukKwon Apr 26, 2025
9e96f56
Allocate kv_cache with stride order (#16605)
wenscarl Apr 26, 2025
54271bb
[ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. (#17011)
charlifu Apr 26, 2025
53e8cf5
[V1][Metrics] Allow V1 AsyncLLM to use custom logger (#14661)
liuzijing2014 Apr 26, 2025
b07bf83
[BugFix] Avoid race conditions in zero-copy tensor transmission (#17203)
njhill Apr 26, 2025
513f074
[CI/test] Fix Eagle Correctness Test (#17209)
WoosukKwon Apr 26, 2025
df6f3ce
[Core] Remove prompt string from engine core data structures (#17214)
njhill Apr 26, 2025
8c1c926
[Bugfix] Fix missing int type for `-n` in multi-image example (#17223)
Isotr0py Apr 26, 2025
909fdaf
[Bugfix] Fix standard models tests (#17217)
DarkLight1337 Apr 26, 2025
c48334d
[Hardware][Intel-Gaudi] Update hpu-extension and update bucketing sys…
adobrzyn Apr 26, 2025
f8acd01
[V1] Add `structural_tag` support using xgrammar (#17085)
russellb Apr 26, 2025
dc2ceca
[BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set (#…
andyxning Apr 26, 2025
e782e0a
[Chore] added stubs for `vllm_flash_attn` during development mode (#1…
aarnphm Apr 26, 2025
52b4f4a
[Docs] Update structured output doc for V1 (#17135)
russellb Apr 26, 2025
10fd1d7
[Bugfix] fix error due to an uninitialized tokenizer when using `skip…
junstar92 Apr 26, 2025
4d17e20
Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACH…
houseroad Apr 26, 2025
fd11a32
[MISC] rename interval to max_recent_requests (#14285)
andyxning Apr 26, 2025
de7eb10
[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation (#16878)
imkero Apr 26, 2025
43eea29
[Minor] Fix lint error in main branch (#17233)
WoosukKwon Apr 26, 2025
3642c59
[CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh (#16271)
reidliu41 Apr 26, 2025
9869453
Update test_flash_attn.py (#17102)
ShuaibinLi Apr 26, 2025
8e4b351
[Kernel][Triton][FP8] Adding fp8 and variable length sequence support…
rasmith Apr 27, 2025
93a126f
[Misc] Make cached tokenizer pickle-compatible (#17048)
DarkLight1337 Apr 27, 2025
4283a28
[Bugfix] Fix QWen2 VL multimodal mapping (#17240)
jeejeelee Apr 27, 2025
838ceda
[Bugfix] Get a specific type of layer from forward context (#17222)
heheda12345 Apr 27, 2025
30215ca
[MISC] Use string annotation types for class definitions (#17244)
jianzs Apr 27, 2025
18445ed
[Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32,…
sfc-gh-zhwang Apr 27, 2025
756848e
[Bugfix] Fix Lora Name Parsing (#17196)
alex-jw-brooks Apr 27, 2025
ed7a29d
[NVIDIA] Support Cutlass MLA for Blackwell GPUs (#16032)
kaixih Apr 27, 2025
690fe01
[Feature] support sequence parallelism using compilation pass (#16155)
cascade812 Apr 27, 2025
d92879b
[doc] Add feature status legend (#17257)
reidliu41 Apr 27, 2025
4213475
[Metrics] Fix minor inconsistencies in bucket progression (#17262)
DarkLight1337 Apr 27, 2025
20e489e
[V1][Spec Decode] Make eagle compatible with prefix caching. (#17137)
LiuXiaoxuanPKU Apr 27, 2025
d8bccde
[BugFix] Fix vllm_flash_attn install issues (#17267)
LucasWilkinson Apr 28, 2025
d1aeea7
[Bugfix] Fix missing ARG in Dockerfile for arm64 platforms (#17261)
lkm-schulz Apr 28, 2025
c12df53
[Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c…
Ther-LF Apr 28, 2025
cb3f2d8
[Bugfix] Fix Mistral3 spatial merge error (#17270)
mgoin Apr 28, 2025
9053d0b
[Doc] Fix wrong github link in LMCache examples (#17274)
KuntaiDu Apr 28, 2025
f211331
[Doc] small fix (#17277)
reidliu41 Apr 28, 2025
8262a3e
[Misc] Validate `stop_token_ids` contents (#17268)
njhill Apr 28, 2025
7fcc422
[Minor][Models] Pass partial_rotary_factor parameter to rope (#17266)
Eviannn Apr 28, 2025
aec9674
[Core] Remove legacy input mapper/processor from V0 (#15686)
DarkLight1337 Apr 28, 2025
fa93cd9
[Model] Add Granite Speech Support (#16246)
alex-jw-brooks Apr 28, 2025
72c5b97
Update tpu_worker.py 's typo (#17288)
idouba Apr 28, 2025
fb1c933
Add missing class docstring for `PromptAdapterConfig` (#17302)
hmellor Apr 28, 2025
344e193
[Bugfix] Add missing `get_language_model` to new MLLMs (#17300)
DarkLight1337 Apr 28, 2025
3ad986c
[doc] update wrong model id (#17287)
reidliu41 Apr 28, 2025
889ebb2
[Misc] Minor typo/grammar in `platforms/interface.py` (#17307)
NickLucche Apr 28, 2025
8b464d9
[Misc] Clean up Qwen2.5-Omni code (#17301)
DarkLight1337 Apr 28, 2025
72dfe4c
[Docs] Add a security guide (#17230)
russellb Apr 28, 2025
f948869
Improve conversion from dataclass configs to argparse arguments (#17303)
hmellor Apr 28, 2025
b6dd32a
Make name of `compressed-tensors` quant method consistent across vLLM…
hmellor Apr 28, 2025
c7941cc
Explicitly explain quant method override ordering and ensure all over…
hmellor Apr 28, 2025
a0304dc
[Security] Don't bind tcp zmq socket to all interfaces (#17197)
russellb Apr 28, 2025
328b04d
Merge branch 'main' into jpvillam/fa_gfx950
jpvillam-amd Apr 28, 2025
2c89cd9
[Chore] cleanup license indicators in light of SPDX (#17259)
aarnphm Apr 28, 2025
cc5befb
[BugFix] Fix cascade attention - RuntimeError: scheduler_metadata mus…
LucasWilkinson Apr 28, 2025
ed24620
[Bugfix] Fix moe weight losing all extra attrs after `process_weights…
charlifu Apr 28, 2025
dcbac4c
[Model] Qwen3 Dense FP8 Compat Fixes (#17318)
simon-mo Apr 28, 2025
550b072
Update rocm.py
jpvillam-amd Apr 28, 2025
ad806ba
Linter
jpvillam-amd Apr 28, 2025
dc6c46b
lint
gshtras Apr 28, 2025
6e74fd4
Support loading transformers models with named parameters (#16868)
Apr 28, 2025
8fc88d6
[Model] Add tuned triton fused_moe configs for Qwen3Moe (#17328)
mgoin Apr 28, 2025
cfe4532
[Benchmark] Add single turn MTBench to Serving Bench (#17202)
ekagra-ranjan Apr 28, 2025
506475d
[Optim] Compute multimodal hash only once per item (#17314)
DarkLight1337 Apr 29, 2025
86d9fc2
implement Structural Tag with Guidance backend (#17333)
mmoskal Apr 29, 2025
e136000
[V1][Spec Decode] Make Eagle model arch config driven (#17323)
ekagra-ranjan Apr 29, 2025
b4ac4fa
[model] make llama4 compatible with pure dense layers (#17315)
luccafong Apr 29, 2025
d6da8a8
[Bugfix] Fix `numel()` downcast in fused_layernorm_dynamic_per_token_…
r-barnes Apr 29, 2025
165cb56
Ignore `'<string>'` filepath (#17330)
zou3519 Apr 29, 2025
17eb306
[Bugfix] Add contiguous call inside rope kernel wrapper (#17091)
timzsu Apr 29, 2025
96e06e3
[Misc] Add a Jinja template to support Mistral3 function calling (#17…
chaunceyjiang Apr 29, 2025
cde384c
[Model] support MiniMax-VL-01 model (#16328)
qscqesze Apr 29, 2025
ebb3930
[Misc] Move config fields to MultiModalConfig (#17343)
DarkLight1337 Apr 29, 2025
bdb2cdd
[Misc]Use a platform independent interface to obtain the device attri…
jiangpeng36 Apr 29, 2025
193e78e
[Fix] Documentation spacing in compilation config help text (#17342)
Zerohertz Apr 29, 2025
4464109
[Build][Bugfix] Restrict setuptools version to <80 (#17320)
gshtras Apr 29, 2025
97cc872
[Model] Ignore rotary embed load for Cohere model (#17319)
ekagra-ranjan Apr 29, 2025
4a5e131
Update docs requirements (#17379)
hmellor Apr 29, 2025
890f104
[Doc] Fix QWen3MOE info (#17381)
jeejeelee Apr 29, 2025
00ee37e
[Bugfix] Clean up MiniMax-VL and fix processing (#17354)
DarkLight1337 Apr 29, 2025
40896bd
`pre-commit autoupdate` (#17380)
hmellor Apr 29, 2025
88ad9ec
[Frontend] Support `chat_template_kwargs` in `LLM.chat` (#17356)
DarkLight1337 Apr 29, 2025
900edfa
Transformers backend tweaks (#17365)
hmellor Apr 29, 2025
0ed27ef
Fix: Spelling of inference (#17387)
a2q1p Apr 29, 2025
2ef5d10
Improve literal dataclass field conversion to argparse argument (#17391)
hmellor Apr 29, 2025
24e6ad3
[V1] Remove num_input_tokens from attn_metadata (#17193)
heheda12345 Apr 29, 2025
a39203f
[Bugfix] add qwen3 reasoning-parser fix content is None when disable …
mofanke Apr 29, 2025
d3cf61b
fix gemma3 results all zero (#17364)
mayuyuace Apr 29, 2025
06ffc7e
[Misc][ROCm] Exclude `cutlass_mla_decode` for ROCm build (#17289)
tywuAMD Apr 29, 2025
608968b
Enabling multi-group kernel tests. (#17115)
Alexei-V-Ivanov-AMD Apr 29, 2025
56d64fb
[Docs] Propose a deprecation policy for the project (#17063)
russellb Apr 29, 2025
0c1c788
[Doc][Typo] Fixing label in new model requests link in overview.md (#…
casinca Apr 29, 2025
792595b
[TPU][V1][CI] Replace `python3 setup.py develop` with standard `pip i…
NickLucche Apr 29, 2025
b37685a
[CI] Uses Python 3.11 for TPU (#17359)
aarnphm Apr 29, 2025
08e15de
[CI/Build] Add retry mechanism for add-apt-repository (#17107)
reidliu41 Apr 29, 2025
2fa2a50
[Bugfix] Fix Minicpm-O-int4 GPTQ model inference (#17397)
Isotr0py Apr 29, 2025
a6977db
Simplify (and fix) passing of guided decoding backend options (#17008)
hmellor Apr 29, 2025
0350809
Remove Falcon3 2x7B from CI (#17404)
hmellor Apr 29, 2025
e8766c6
Merge remote-tracking branch 'upstream/main'
gshtras Apr 29, 2025
c9c1b59
Fix: Python package installation for opentelmetry (#17049)
dilipgb Apr 29, 2025
70788bd
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE (#17211)
luyuzhe111 Apr 29, 2025
7489ec0
Remove Bamba 9B from CI (#17407)
hmellor Apr 29, 2025
34120f5
[V1][Feature] Enable Speculative Decoding with Structured Outputs (#1…
benchislett Apr 30, 2025
4055130
[release] Always git fetch all to get latest tag on TPU release (#17322)
khluu Apr 30, 2025
1c2bc7e
Truncation control for embedding models (#14776)
gmarinho2 Apr 30, 2025
2c4f59a
Update PyTorch to 2.7.0 (#16859)
huydhn Apr 30, 2025
13698db
Improve configs - `ModelConfig` (#17130)
hmellor Apr 30, 2025
d1f569b
Fix call to `logger.info_once` (#17416)
hmellor Apr 30, 2025
88fcf00
Fix some speculative decode tests with tl.dot (#17371)
huydhn Apr 30, 2025
a44c4f1
Support LoRA for Mistral3 (#17428)
mgoin Apr 30, 2025
6ed9f60
[Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue (#17298)
jikunshang Apr 30, 2025
ed6cfb9
[Hardware][Intel GPU] Upgrade to torch 2.7 (#17444)
jikunshang Apr 30, 2025
be633fb
[Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_…
chaunceyjiang Apr 30, 2025
54072f3
[MODEL ADDITION] Ovis2 Model Addition (#15826)
mlinmg Apr 30, 2025
ece5a8b
Make the _apply_rotary_emb compatible with dynamo (#17435)
houseroad Apr 30, 2025
1534d38
[Misc] Remove deprecated files (#17447)
chaunceyjiang Apr 30, 2025
d803786
[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None (#15755)
lengrongfu Apr 30, 2025
a7d5b01
[TPU][V1][CI] Update regression test baseline for v6 CI (#17064)
NickLucche Apr 30, 2025
77073c7
[Core] Prevent side-channel attacks via cache salting (#17045)
dr75 Apr 30, 2025
0be6d05
[V1][Metrics] add support for kv event publishing (#16750)
alec-flowers Apr 30, 2025
2990cee
[Feature] The Qwen3 reasoning parser supports guided decoding (#17466)
chaunceyjiang Apr 30, 2025
39317cf
[Docs] Add command for running mypy tests from CI (#17475)
russellb Apr 30, 2025
da4e768
[Fix] Support passing args to logger (#17425)
aarnphm Apr 30, 2025
739e03b
[Bugfix] Fixed mistral tokenizer path when pointing to file (#17457)
psav Apr 30, 2025
947f2f5
[V1] Allow turning off pickle fallback in vllm.v1.serial_utils (#17427)
russellb Apr 30, 2025
0b7e701
[Docs] Update optimization.md doc (#17482)
mgoin Apr 30, 2025
d586ddc
[BugFix] Fix authorization of openai_transcription_client.py (#17321)
hhy3 Apr 30, 2025
584f5fb
[Bugfix][ROCm] Restrict ray version due to a breaking release (#17480)
gshtras Apr 30, 2025
2ac74d0
[doc] add install tips (#17373)
reidliu41 Apr 30, 2025
42d9a2c
doc: fix bug report Github template formatting (#17486)
davidxia Apr 30, 2025
8e45f88
Merge remote-tracking branch 'upstream/main' into upstream_merge_2025…
gshtras Apr 30, 2025
81ecf42
[v1][Spec Decode] Make sliding window compatible with eagle prefix ca…
heheda12345 Apr 30, 2025
285ac51
Merge remote-tracking branch 'upstream/main' into jpvillam/fa_gfx950
gshtras Apr 30, 2025
0bc1d7c
No vllm.vllm_flash_attn.layers.rotary on ROCm
gshtras Apr 30, 2025
134d285
Merge remote-tracking branch 'origin/rocm_fix' into upstream_merge_20…
gshtras Apr 30, 2025
2921150
Merge remote-tracking branch 'origin/jpvillam/fa_gfx950' into upstrea…
gshtras Apr 30, 2025
f3a5bf0
Restore the function that is used elsewhere
gshtras Apr 30, 2025
8334e54
Merge remote-tracking branch 'origin/jpvillam/fa_gfx950' into upstrea…
gshtras Apr 30, 2025
200bbf9
Bump Compressed Tensors version to 0.9.4 (#17478)
rahul-tuli Apr 30, 2025
02bd654
[Misc] Rename Audios -> Audio in Qwen2audio Processing (#17507)
alex-jw-brooks May 1, 2025
dbc18e7
[CI][TPU] Skip Multimodal test (#17488)
lsy323 May 1, 2025
08fb558
[Bugfix][ROCm] Fix import error on ROCm (#17495)
gshtras May 1, 2025
1144a8e
[Bugfix] Temporarily disable gptq_bitblas on ROCm (#17411)
nlzy May 1, 2025
17b4d85
[CI][TPU] Skip structured outputs+spec decode tests on TPU (#17510)
mgoin May 1, 2025
aa4502e
[CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg (#17…
mgoin May 1, 2025
afb4429
[CI/Build] Reorganize models tests (#17459)
DarkLight1337 May 1, 2025
7ab643e
FIxing the AMD test failures caused by PR#16457 (#17511)
Alexei-V-Ivanov-AMD May 1, 2025
7a0a146
[Build] Require setuptools >= 77.0.3 for PEP 639 (#17389)
russellb May 1, 2025
90d0a54
[ROCm] Effort to reduce the number of environment variables in comman…
hongxiayang May 1, 2025
13cf6b6
[BugFix] fix speculative decoding memory leak when speculation is dis…
noah-yoshida May 1, 2025
3c3d767
[BugFix] Fix mla cpu - missing 3 required positional arguments (#17494)
LucasWilkinson May 1, 2025
26bc4bb
Avoid overwriting vllm_compile_cache.py (#17418)
youngkent May 1, 2025
fbefc8a
[Core] Enable IPv6 with vllm.utils.make_zmq_socket() (#16506)
russellb May 1, 2025
015069b
[Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content (…
chaunceyjiang May 1, 2025
a257d9b
Improve configs - `ObservabilityConfig` (#17453)
hmellor May 1, 2025
86a1f67
[Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to sele…
tishizaki May 1, 2025
1903c0b
[Frontend] Show progress bar for adding requests (#17525)
DarkLight1337 May 1, 2025
48e925f
[Misc] Clean up test docstrings and names (#17521)
DarkLight1337 May 1, 2025
2007d4d
[FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X…
tjtanaa May 1, 2025
b74d888
Fix more broken speculative decode tests (#17450)
huydhn May 1, 2025
7169f87
[doc] add streamlit integration (#17522)
reidliu41 May 1, 2025
f5a3c65
[FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe conf…
tjtanaa May 1, 2025
98060b0
[Feature][Frontend]: Deprecate --enable-reasoning (#17452)
chaunceyjiang May 1, 2025
28566d7
[ROCm] remove unsupported archs from rocm triton flash-attention supp…
hongxiayang May 1, 2025
460a2b1
[torch.compile] Add torch inductor pass for fusing silu_and_mul with …
SageMoore May 1, 2025
7423cf0
[Misc] refactor example - cpu_offload_lmcache (#17460)
reidliu41 May 1, 2025
f2e7af9
[CI/Build] Remove `awscli` dependency (#17532)
DarkLight1337 May 1, 2025
6768ff4
Move the last arguments in `arg_utils.py` to be in their final groups…
hmellor May 1, 2025
88c8304
[Model] Refactor Ovis2 to support original tokenizer (#17537)
Isotr0py May 1, 2025
4acfa33
[ROCm] update installation guide to include build aiter from source i…
hongxiayang May 1, 2025
61c299f
[Misc]add configurable cuda graph size (#17201)
CXIAAAAA May 1, 2025
c1cb05e
Fix Quark API use
gshtras May 1, 2025
9b1769d
[Bugfix] Fix lint error (#17547)
DarkLight1337 May 1, 2025
811a6c0
[ROCM] Add gfx950 to the custom attention archs (#16034)
jpvillam-amd May 1, 2025
04f2cfc
Remove duplicate code from dbrx.py (#17550)
sstamenk May 1, 2025
173daac
[Bug]change the position of cuda_graph_sizes in dataclasses (#17548)
CXIAAAAA May 1, 2025
9b70e2b
[Misc][Tools][Benchmark] Publish script to auto tune server parameter…
Chenyaaang May 1, 2025
39c0813
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE3 (#17504)
zixi-qi May 1, 2025
24aebae
[Bugfix] Disable gptq_bitblas for <SM80 to fix GPTQ on V100/T4 (#17541)
mgoin May 2, 2025
afb12e4
[Doc] note that not all unit tests pass on CPU platforms (#17554)
davidxia May 2, 2025
afcb3f8
[Attention] MLA move o_proj q_proj into cuda-graph region (#17484)
LucasWilkinson May 2, 2025
292fc59
[CI] Actually run tests/kv_transfer/test_disagg.py in CI (#17555)
mgoin May 2, 2025
b4003d1
Check if bitblas is installed during support check (#17572)
mgoin May 2, 2025
f89d0e1
[Misc] Continue refactoring model tests (#17573)
DarkLight1337 May 2, 2025
f192ca9
Fix PixtralHF missing spatial_merge_size (#17571)
mgoin May 2, 2025
109e15a
Add `pt_load_map_location` to allow loading to cuda (#16869)
jerryzh168 May 2, 2025
9e2de9b
[Bugifx] Remove TritonPlaceholder from sys.modules (#17317)
Isotr0py May 2, 2025
cc2a77d
[Core] [Bugfix] Add Input Embeddings (#15428)
qthequartermasterman May 2, 2025
c777df7
[BugFix] Fix Memory Leak (#17567)
robertgshaw2-redhat May 2, 2025
d754386
[Misc] Rename assets for testing (#17575)
DarkLight1337 May 2, 2025
b8b0859
add more pytorch related tests for torch nightly (#17422)
yangw-dev May 2, 2025
6d1479c
[doc] add the print result (#17584)
reidliu41 May 2, 2025
785d75a
Automatically tell users that dict args must be valid JSON in CLI (#1…
hmellor May 2, 2025
99404f5
[Security] Fix image hash collision (#17378)
DarkLight1337 May 2, 2025
868c546
Support W8A8 INT8 MoE for compressed-tensors (#16745)
mgoin May 2, 2025
3a500cd
[doc] miss result (#17589)
reidliu41 May 2, 2025
cb23495
[Misc] Clean up input processing (#17582)
DarkLight1337 May 2, 2025
2c68ff9
Merge branch 'main' into upstream_merge_2025_04_29
gshtras May 2, 2025
4c33d67
[Bugfix] fix tmp_out and exp_sums dimensions (#17438)
hliuca May 2, 2025
29241ca
Merge remote-tracking branch 'upstream/main' into upstream_merge_2025…
gshtras May 2, 2025
0b8eaec
Re-fix Quark API
gshtras May 2, 2025
f3f620a
Using the right torch API
gshtras May 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m deepseek-ai/DeepSeek-V2-Lite-Chat -b "auto" -l 1000 -f 5 -t 2
model_name: "deepseek-ai/DeepSeek-V2-Lite-Chat"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5
model_name: "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic -b "auto" -l 250 -f 5 -t 8
model_name: "neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b "auto" -l 250 -f 5 -t 4
model_name: "neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16 -b auto -l 1319 -f 5 -t 1
model_name: "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise -b "auto" -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b "auto" -l 250 -f 5 -t 4
model_name: "Qwen/Qwen2-57B-A14B-Instruct"
tasks:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# For vllm script, with -t option (tensor parallel size).
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
tasks:
Expand Down
11 changes: 6 additions & 5 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
steps:
- label: "Build wheel - CUDA 12.4"
- label: "Build wheel - CUDA 12.8"
agents:
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/scripts/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"

- label: "Build wheel - CUDA 12.1"
- label: "Build wheel - CUDA 12.6"
agents:
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.6.3 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/scripts/upload-wheels.sh"
Expand Down Expand Up @@ -48,7 +48,7 @@ steps:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain -f docker/Dockerfile ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"

- label: "Build and publish TPU release image"
Expand All @@ -57,6 +57,7 @@ steps:
agents:
queue: tpu_queue_postmerge
commands:
- "git fetch --all"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f docker/Dockerfile.tpu ."
- "docker push vllm/vllm-tpu:nightly"
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"
Expand Down
74 changes: 44 additions & 30 deletions .buildkite/scripts/hardware_ci/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -75,37 +75,51 @@ HF_MOUNT="/root/.cache/huggingface"
commands=$@
echo "Commands:$commands"
#ignore certain kernels tests
if [[ $commands == *" kernels "* ]]; then
if [[ $commands == *" kernels/core"* ]]; then
commands="${commands} \
--ignore=kernels/test_attention_selector.py \
--ignore=kernels/test_blocksparse_attention.py \
--ignore=kernels/test_causal_conv1d.py \
--ignore=kernels/test_cutlass.py \
--ignore=kernels/test_encoder_decoder_attn.py \
--ignore=kernels/test_flash_attn.py \
--ignore=kernels/test_flashinfer.py \
--ignore=kernels/test_int8_quant.py \
--ignore=kernels/test_machete_gemm.py \
--ignore=kernels/test_mamba_ssm.py \
--ignore=kernels/test_marlin_gemm.py \
--ignore=kernels/test_moe.py \
--ignore=kernels/test_prefix_prefill.py \
--ignore=kernels/test_rand.py \
--ignore=kernels/test_sampler.py \
--ignore=kernels/test_cascade_flash_attn.py \
--ignore=kernels/test_mamba_mixer2.py \
--ignore=kernels/test_aqlm.py \
--ignore=kernels/test_machete_mm.py \
--ignore=kernels/test_mha_attn.py \
--ignore=kernels/test_block_fp8.py \
--ignore=kernels/test_cutlass_moe.py \
--ignore=kernels/test_mamba_ssm_ssd.py \
--ignore=kernels/test_attention.py \
--ignore=kernels/test_block_int8.py \
--ignore=kernels/test_fused_quant_layernorm.py \
--ignore=kernels/test_int8_kernel.py \
--ignore=kernels/test_triton_moe_ptpc_fp8.py \
--ignore=kernels/test_permute_cols.py"
--ignore=kernels/core/test_fused_quant_layernorm.py \
--ignore=kernels/core/test_permute_cols.py"
fi

if [[ $commands == *" kernels/attention"* ]]; then
commands="${commands} \
--ignore=kernels/attention/stest_attention_selector.py \
--ignore=kernels/attention/test_blocksparse_attention.py \
--ignore=kernels/attention/test_encoder_decoder_attn.py \
--ignore=kernels/attention/test_attention_selector.py \
--ignore=kernels/attention/test_flash_attn.py \
--ignore=kernels/attention/test_flashinfer.py \
--ignore=kernels/attention/test_prefix_prefill.py \
--ignore=kernels/attention/test_cascade_flash_attn.py \
--ignore=kernels/attention/test_mha_attn.py \
--ignore=kernels/attention/test_lightning_attn.py \
--ignore=kernels/attention/test_attention.py"
fi

if [[ $commands == *" kernels/quantization"* ]]; then
commands="${commands} \
--ignore=kernels/quantization/test_int8_quant.py \
--ignore=kernels/quantization/test_aqlm.py \
--ignore=kernels/quantization/test_machete_mm.py \
--ignore=kernels/quantization/test_block_fp8.py \
--ignore=kernels/quantization/test_block_int8.py \
--ignore=kernels/quantization/test_marlin_gemm.py \
--ignore=kernels/quantization/test_cutlass_scaled_mm.py \
--ignore=kernels/quantization/test_int8_kernel.py"
fi

if [[ $commands == *" kernels/mamba"* ]]; then
commands="${commands} \
--ignore=kernels/mamba/test_mamba_mixer2.py \
--ignore=kernels/mamba/test_causal_conv1d.py \
--ignore=kernels/mamba/test_mamba_ssm_ssd.py"
fi

if [[ $commands == *" kernels/moe"* ]]; then
commands="${commands} \
--ignore=kernels/moe/test_moe.py \
--ignore=kernels/moe/test_cutlass_moe.py \
--ignore=kernels/moe/test_triton_moe_ptpc_fp8.py"
fi

#ignore certain Entrypoints/openai tests
Expand Down
18 changes: 9 additions & 9 deletions .buildkite/scripts/upload-wheels.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,11 @@ aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
if [[ $normal_wheel == *"cu118"* ]]; then
# if $normal_wheel matches cu118, do not upload the index.html
echo "Skipping index files for cu118 wheels"
elif [[ $normal_wheel == *"cu121"* ]]; then
# if $normal_wheel matches cu121, do not upload the index.html
echo "Skipping index files for cu121 wheels"
elif [[ $normal_wheel == *"cu126"* ]]; then
# if $normal_wheel matches cu126, do not upload the index.html
echo "Skipping index files for cu126 wheels"
else
# only upload index.html for cu124 wheels (default wheels)
# only upload index.html for cu128 wheels (default wheels)
aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
fi
Expand All @@ -66,12 +66,12 @@ aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"
if [[ $normal_wheel == *"cu118"* ]]; then
# if $normal_wheel matches cu118, do not upload the index.html
echo "Skipping index files for cu118 wheels"
elif [[ $normal_wheel == *"cu121"* ]]; then
# if $normal_wheel matches cu121, do not upload the index.html
echo "Skipping index files for cu121 wheels"
elif [[ $normal_wheel == *"cu126"* ]]; then
# if $normal_wheel matches cu126, do not upload the index.html
echo "Skipping index files for cu126 wheels"
else
# only upload index.html for cu124 wheels (default wheels)
# only upload index.html for cu128 wheels (default wheels)
aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
fi

aws s3 cp "$wheel" "s3://vllm-wheels/$version/"
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"
Loading