Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
c40692b
[Misc] Add parallel state `node_count` function (#20045)
njhill Jun 25, 2025
4e0db57
Fix the path to the testing script. (#20082)
QiliangCui Jun 25, 2025
9f0608f
[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine (…
izhuhaoran Jun 25, 2025
2cc2069
[TPU][Bugfix] fix kv cache padding (#20048)
yaochengji Jun 25, 2025
55c65ab
[P/D] Avoid stranding blocks in P when aborted in D's waiting queue (…
njhill Jun 25, 2025
2d7620c
[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN (#19919)
Chenyaaang Jun 25, 2025
296ce95
[CI] Add SM120 to the Dockerfile (#19794)
mgoin Jun 25, 2025
754b00e
[Bugfix] Fix Mistral tool-parser regex for nested JSON (#20093)
mgoin Jun 26, 2025
2582683
[PD] Skip `tp_size` exchange with rank0 (#19413)
NickLucche Jun 26, 2025
9502c38
[Benchmark][Bug] Fix multiple bugs in bench and add args to spec_deco…
ekagra-ranjan Jun 26, 2025
65397e4
[Bugfix] Allow `CUDA_VISIBLE_DEVICES=''` in `Platform.device_id_to_ph…
eicherseiji Jun 26, 2025
1d7c29f
[Doc] Update docs for New Model Implementation (#20115)
DarkLight1337 Jun 26, 2025
d188913
[Refactor] Remove unused library (#20099)
yewentao256 Jun 26, 2025
0567c82
[CPU] Fix torch version in x86 CPU backend (#19258)
bigPYJ1151 Jun 26, 2025
167aca4
[Misc] Use collapsible blocks for benchmark examples. (#20017)
reidliu41 Jun 26, 2025
84c260c
[Docs] Improve frameworks/helm.md (#20113)
windsonsea Jun 26, 2025
27c065d
[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break a…
tjtanaa Jun 26, 2025
1f5d178
Revert "[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 …
mgoin Jun 26, 2025
c894c5d
[Bug Fix] Fix address/port already in use error for deep_ep test (#20…
yewentao256 Jun 26, 2025
0907d50
[Doc] Automatically signed-off by PyCharm (#20120)
noooop Jun 26, 2025
6393b03
[Doc] Auto sign-off for VSCode (#20132)
DarkLight1337 Jun 26, 2025
34878a0
[Doc] Rename page titles (#20130)
DarkLight1337 Jun 26, 2025
0bceac9
Spam folks if config.py changes (#20131)
tlrmchlsmth Jun 26, 2025
b69781f
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention b…
jikunshang Jun 26, 2025
04e1642
[TPU] add kv cache update kernel (#19928)
yaochengji Jun 26, 2025
5623088
[Refactor] Rename commnication utils (#20091)
yewentao256 Jun 26, 2025
07b8fae
[Doc] correct LoRA capitalization (#20135)
kyolebu Jun 26, 2025
e9fd658
[Feature] Expert Parallelism Load Balancer (EPLB) (#18343)
abmfy Jun 26, 2025
71799fd
[CI Failure] Fix OOM with test_oot_registration_embedding (#20144)
mgoin Jun 27, 2025
a57d57f
[Quantization] Bump to use latest `compressed-tensors` (#20033)
dsikka Jun 27, 2025
2d7779f
[Perf] SM100 FP8 GEMM Optimizations after cutlass_profiler (#20071)
ilmarkov Jun 27, 2025
44d2e6a
[Bugfix] Build moe_data for both sm100 and sm90 (#20086)
mgoin Jun 27, 2025
0740e29
[Feature] add quick all reduce (#19744)
lihaoyang-amd Jun 27, 2025
8b64c89
[CI] Sync test dependency with test.in for torch nightly (#19632)
yangw-dev Jun 27, 2025
e110930
[Fix] Fix gemma CI test failing on main (#20124)
tdoublep Jun 27, 2025
cd4cfee
[Model][1/N] Automatic conversion of CrossEncoding model (#20012)
noooop Jun 27, 2025
6e244ae
[Perf][Frontend] eliminate api_key and x_request_id headers middlewar…
Yazan-Sharaya Jun 27, 2025
dec197e
Quick Fix by adding conditional import for flash_attn_varlen_func in …
xuechendi Jun 27, 2025
d1c956d
Gemma3n (Text-only) (#20134)
robertgshaw2-redhat Jun 27, 2025
4ab3ac2
[Bugfix] Fix flaky failure when getting DP ports (#20151)
mgoin Jun 27, 2025
aa0dc77
[Perf] Improved perf for resolve_chat_template_content_format (#20065)
ilyal-cerebras Jun 27, 2025
94a55c7
[Fix][ROCm] Remove unused variables to fix build error on GFX11/12 (#…
hyoon1 Jun 27, 2025
aafabaa
[Fix][torch.compile] Enable custom ops by default when Inductor off (…
ProExpertProg Jun 27, 2025
c6c9830
[Bugfix] Mark 'hidden_states' as mutable in moe_forward registration.…
bnellnm Jun 27, 2025
e8c3bd2
[Bugfix] Fix some narrowing conversion warnings (#20141)
tlrmchlsmth Jun 27, 2025
3c545c0
[CI/Build] Allow hermetic builds (#18064)
fabiendupont Jun 27, 2025
c329cec
[CI Fix] Pin tests/models/registry.py MiniMaxText01ForCausalLM to rev…
mgoin Jun 28, 2025
e53be6f
[Misc] Add type assertion of request_id for LLMEngine.add_request (#1…
SHA-4096 Jun 28, 2025
a29e62e
Fix num_token_padding support for static per-tensor scaled_fp8_quant …
mgoin Jun 28, 2025
d45417b
fix ci issue distributed 4 gpu test (#20204)
yewentao256 Jun 28, 2025
f719772
[Bugfix] Properly reject requests with empty list guided_choice (#20195)
mgoin Jun 28, 2025
7b460c2
[BugFix] Fix the incorrect func name in the comments. (config.py) (#2…
1195343015 Jun 28, 2025
8615d97
[CI/Build] Add new CI job to validate Hybrid Models for every PR (#2…
tdoublep Jun 28, 2025
daceac5
[Frontend] Generalize `v1/audio/transcriptions` endpoint (#20179)
NickLucche Jun 28, 2025
daec9de
[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel exec…
s3woz Jun 28, 2025
4d36693
[Refactor] Create a function util and cache the results for `has_deep…
yewentao256 Jun 28, 2025
7b1895e
[CI Fix] Try fixing eagle e2e test OOM by reducing block allocation (…
mgoin Jun 29, 2025
6f2f53a
[Quantization] Add compressed-tensors NVFP4 MoE Support (#19990)
dsikka Jun 29, 2025
6c9837a
Fix cuda_archs_loose_intersection when handling sm_*a (#20207)
huydhn Jun 29, 2025
65b1cbb
[Model] support dots1 (#18254)
redmoe-moutain Jun 30, 2025
5a52f38
[BUGFIX][DEEPSEEK][MODEL_LOAD] fix w13, w2 weight not initialized ass…
xuechendi Jun 30, 2025
19108ef
[Misc] Fix import (#20233)
WoosukKwon Jun 30, 2025
022c58b
[doc] Add Slack and Forum to the top navigation (#20208)
reidliu41 Jun 30, 2025
f5dfa07
[Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model (…
noiji Jun 30, 2025
e936e40
[Bugfix] Fix processor initialization in transformers 4.53.0 (#20244)
Isotr0py Jun 30, 2025
8fe7fc8
[Quantization] Improve BitsAndBytesModelLoader (#20242)
jeejeelee Jun 30, 2025
3ee56e2
[Docs] Fix 1-2-3 list in v1/prefix_caching.md (#20243)
windsonsea Jun 30, 2025
1c50e10
[Bugfix] fix quark ptpc (#20251)
lihaoyang-amd Jun 30, 2025
2062c07
[Spec Decode] Refactor spec decoding into a separate function (#20238)
WoosukKwon Jun 30, 2025
2965c99
[Spec Decode] Clean up spec decode example (#20240)
WoosukKwon Jun 30, 2025
2863bef
[Optimization] Use Shared `CachedRequestData` Instance Across All Req…
WoosukKwon Jun 30, 2025
551ef16
[Unit Test] Add unit test for deep gemm (#20090)
yewentao256 Jun 30, 2025
d171777
Merge remote-tracking branch 'upstream/main'
gshtras Jun 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,8 @@ run_and_track_test 14 "test_tpu_qkv_linear.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_tpu_qkv_linear.py"
run_and_track_test 15 "test_spmd_model_weight_loading.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_spmd_model_weight_loading.py"
run_and_track_test 16 "test_kv_cache_update_kernel.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_kv_cache_update_kernel.py"

# After all tests have been attempted, exit with the overall status.
if [ "$overall_script_exit_code" -ne 0 ]; then
Expand Down
1 change: 1 addition & 0 deletions .buildkite/scripts/hardware_ci/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,5 @@ docker run \
sh -c '
VLLM_USE_V1=0 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m
VLLM_USE_V1=0 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m -tp 2
VLLM_USE_V1=1 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager
'
2 changes: 1 addition & 1 deletion .buildkite/scripts/tpu/docker_run_bm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ docker run \

echo "run script..."
echo
docker exec "$CONTAINER_NAME" /bin/bash -c ".buildkite/scripts/hardware_ci/run_bm.sh"
docker exec "$CONTAINER_NAME" /bin/bash -c ".buildkite/scripts/tpu/run_bm.sh"

echo "copy result back..."
VLLM_LOG="$LOG_ROOT/$TEST_NAME"_vllm_log.txt
Expand Down
44 changes: 42 additions & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,16 @@ steps:
# TODO: add `--strict` once warnings in docstrings are fixed
- mkdocs build

- label: Pytorch Nightly Dependency Override Check # 2min
# if this test fails, it means the nightly torch version is not compatible with some
# of the dependencies. Please check the error message and add the package to whitelist
# in /vllm/tools/generate_nightly_torch_test.py
soft_fail: true
source_file_dependencies:
- requirements/nightly_torch_test.txt
commands:
- bash standalone_tests/pytorch_nightly_dependency.sh

- label: Async Engine, Inputs, Utils, Worker Test # 24min
mirror_hardwares: [amdexperimental]
source_file_dependencies:
Expand Down Expand Up @@ -168,6 +178,23 @@ steps:
- VLLM_ALLOW_INSECURE_SERIALIZATION=1 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
- popd

- label: EPLB Algorithm Test
working_dir: "/vllm-workspace/tests"
source_file_dependencies:
- vllm/distributed/eplb
- tests/distributed/test_eplb_algo.py
commands:
- pytest -v -s distributed/test_eplb_algo.py

- label: EPLB Execution Test # 5min
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
- vllm/distributed/eplb
- tests/distributed/test_eplb_execute.py
commands:
- pytest -v -s distributed/test_eplb_execute.py

- label: Metrics, Tracing Test # 10min
mirror_hardwares: [amdexperimental, amdproduction]
num_gpus: 2
Expand Down Expand Up @@ -509,6 +536,17 @@ steps:
- pip freeze | grep -E 'torch'
- pytest -v -s models/language -m core_model

- label: Language Models Test (Hybrid) # 35 min
mirror_hardwares: [amdexperimental]
torch_nightly: true
source_file_dependencies:
- vllm/
- tests/models/language/generation
commands:
# Install causal-conv1d for plamo2 models here, as it is not compatible with pip-compile.
- pip install 'git+https://github.com/Dao-AILab/[email protected]'
- pytest -v -s models/language/generation -m hybrid_model

- label: Language Models Test (Extended Generation) # 1hr20min
mirror_hardwares: [amdexperimental]
optional: true
Expand All @@ -518,7 +556,7 @@ steps:
commands:
# Install causal-conv1d for plamo2 models here, as it is not compatible with pip-compile.
- pip install 'git+https://github.com/Dao-AILab/[email protected]'
- pytest -v -s models/language/generation -m 'not core_model'
- pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'

- label: Language Models Test (Extended Pooling) # 36min
mirror_hardwares: [amdexperimental]
Expand Down Expand Up @@ -619,11 +657,13 @@ steps:
commands:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
- python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
- python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code

- label: Distributed Tests (2 GPUs) # 40min
Expand Down Expand Up @@ -748,7 +788,7 @@ steps:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models.txt

- label: Weight Loading Multiple GPU Test - Large Models # optional
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental]
working_dir: "/vllm-workspace/tests"
num_gpus: 2
gpu: a100
Expand Down
5 changes: 5 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,11 @@ repos:
files: ^requirements/test\.(in|txt)$
- repo: local
hooks:
- id: format-torch-nightly-test
name: reformat nightly_torch_test.txt to be in sync with test.in
language: python
entry: python tools/generate_nightly_torch_test.py
files: ^requirements/test\.(in|txt)$
- id: mypy-local
name: Run mypy for local Python installation
entry: tools/mypy.sh 0 "local"
Expand Down
34 changes: 31 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -513,6 +513,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
CUDA_ARCHS "${FP4_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_NVFP4=1")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM100=1")
message(STATUS "Building NVFP4 for archs: ${FP4_ARCHS}")
else()
message(STATUS "Not building NVFP4 as no compatible archs were found.")
Expand Down Expand Up @@ -547,8 +548,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# if it's possible to compile MoE kernels that use its output.
cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND SCALED_MM_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu"
"csrc/quantization/cutlass_w8a8/moe/moe_data.cu")
set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_ARCHS}")
Expand All @@ -562,7 +562,27 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"if you intend on running FP8 quantized MoE models on Hopper.")
else()
message(STATUS "Not building grouped_mm_c3x as no compatible archs found "
"in CUDA target architectures")
"in CUDA target architectures.")
endif()
endif()

# moe_data.cu is used by all CUTLASS MoE kernels.
cuda_archs_loose_intersection(CUTLASS_MOE_DATA_ARCHS "9.0a;10.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND CUTLASS_MOE_DATA_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/moe/moe_data.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${CUTLASS_MOE_DATA_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
message(STATUS "Building moe_data for archs: ${CUTLASS_MOE_DATA_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND CUTLASS_MOE_DATA_ARCHS)
message(STATUS "Not building moe_data as CUDA Compiler version is "
"not >= 12.3, we recommend upgrading to CUDA 12.3 or later "
"if you intend on running FP8 quantized MoE models on Hopper or Blackwell.")
else()
message(STATUS "Not building moe_data as no compatible archs found "
"in CUDA target architectures.")
endif()
endif()

Expand Down Expand Up @@ -638,6 +658,14 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# if CUDA endif
endif()

if (VLLM_GPU_LANG STREQUAL "HIP")
# Add QuickReduce kernels
list(APPEND VLLM_EXT_SRC
"csrc/custom_quickreduce.cu"
)
# if ROCM endif
endif()

message(STATUS "Enabling C extension.")
define_gpu_extension_target(
_C
Expand Down
Loading