Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
21d7d67
Functionalized patterns in prep for utility
ProExpertProg Sep 6, 2025
f3b4cf1
TEMP Mostly working
ProExpertProg Sep 9, 2025
cdad3c0
TEMP: fixed rmsnorm issue (TODO assert dtypes in fused norm_quant ker…
ProExpertProg Sep 12, 2025
8e4a56f
rms works fully now, had to remove more conversions (and add them in …
ProExpertProg Sep 16, 2025
e151e6d
quant works except (torch,torch)
ProExpertProg Sep 16, 2025
14fdc8b
quant with fix for pure torch, broke others
ProExpertProg Sep 18, 2025
05a65f3
ALL WORKS
ProExpertProg Sep 18, 2025
e6b394e
Add TODO
ProExpertProg Sep 20, 2025
d96913a
Cleanup test_fusion.py, added extra layer of rms/quant
ProExpertProg Sep 25, 2025
b172747
Functionalize attn+quant patterns
ProExpertProg Sep 25, 2025
1ae80c6
Move global vllm_config to pass manager
ProExpertProg Sep 25, 2025
77835fd
Attention fusion works with custom ops
ProExpertProg Sep 25, 2025
1277999
Remove V0 attn fusion test
ProExpertProg Sep 25, 2025
d843a67
Add triton attn test to attn+quant fusion
ProExpertProg Sep 26, 2025
cdd1529
Flat product for better test names/visibility
ProExpertProg Sep 26, 2025
141a37e
Fix rmsnorm
ProExpertProg Sep 26, 2025
c6d6c3b
Refactor E2E attn fusion test
ProExpertProg Sep 26, 2025
490ac86
Add TP=2 test (untested)
ProExpertProg Sep 26, 2025
d0b1b56
improve tests by adding more cases
ProExpertProg Sep 26, 2025
47b4688
TEMP working on caplog
ProExpertProg Sep 27, 2025
ae7f56f
Temp MP workaround P2
ProExpertProg Sep 30, 2025
eb899a4
Temp MP workaround P3
ProExpertProg Sep 30, 2025
a2aa978
Test for caplog utils
ProExpertProg Oct 1, 2025
21a9f9f
Fixed tests, passing with 2.8, 2.9 tbd
ProExpertProg Oct 2, 2025
66a35a9
Update tests/compile/backend.py
ProExpertProg Oct 2, 2025
7eb1364
Update csrc/layernorm_kernels.cu
ProExpertProg Oct 2, 2025
5fef180
clean up fullgraph tests
ProExpertProg Oct 2, 2025
db479ae
TEMP allreduce fusion
ProExpertProg Oct 2, 2025
54189a9
allreduce fusion working (custom ops on)
ProExpertProg Oct 3, 2025
b7f52bf
allreduce fusion working with/without custom ops (except fp4)
ProExpertProg Oct 3, 2025
d09a278
allreduce fusion working with/without custom ops (with fp4)
ProExpertProg Oct 3, 2025
c8675ff
log depyf folder, fix context for TestBackend, fix pattern dump
ProExpertProg Oct 3, 2025
d3f95fe
fullgraph allreduce test update requirements
ProExpertProg Oct 3, 2025
4dbfcf7
Move e2e tests to new file, add to test pipeline
ProExpertProg Oct 3, 2025
31d0127
Add e2e fusions to fullgraph test (should work with Triton backend), …
ProExpertProg Oct 3, 2025
c653d24
Fix spelling, precommit
ProExpertProg Oct 4, 2025
1756f67
add back fp4
ProExpertProg Oct 4, 2025
5619bc3
clean up e2e tests
ProExpertProg Oct 10, 2025
32989d8
add pattern for final allreduce in model
ProExpertProg Oct 10, 2025
46ee626
add more comprehensive testing for quantfp8 (-rmsnorm+-quant still fa…
ProExpertProg Oct 10, 2025
a1c7fdb
add more comprehensive testing for allreduce-rmsnorm, fix fp4 (-rmsno…
ProExpertProg Oct 10, 2025
c3264d8
Fix partial match rmsnorm+quant, fix allreduce+rmsnorm match
ProExpertProg Oct 10, 2025
095277c
Simplify matcher utils by using RMSNorm.forward_static
ProExpertProg Oct 10, 2025
52f78ce
Add allreduce test to 2-gpu test
ProExpertProg Oct 11, 2025
1b1a63e
Fix e2e allreduce fusion test
ProExpertProg Oct 11, 2025
0d6e550
fix func test
ProExpertProg Oct 12, 2025
26892df
fix pass manager test
ProExpertProg Oct 12, 2025
3547b87
fix sequence parallelism test
ProExpertProg Oct 12, 2025
af1ffa7
PR review
ProExpertProg Oct 15, 2025
97b3ff2
Merge remote-tracking branch 'upstream/main' into luka/custom-op-matc…
ProExpertProg Oct 15, 2025
b5f89e5
Cleanup test_full_graph.py
ProExpertProg Oct 15, 2025
f6429e4
Cleanup test_fusion_attn.py
ProExpertProg Oct 15, 2025
8a363d3
Slight improvement for E2E fusion
ProExpertProg Oct 15, 2025
12a7c6d
Tests & docs for flat_product
ProExpertProg Oct 15, 2025
db16ee1
Merge branch 'main' into luka/custom-op-matching-2
ProExpertProg Oct 15, 2025
8ffb474
Remove/fix TODOs
ProExpertProg Oct 15, 2025
2a6299c
Fix e2e test patterns
ProExpertProg Oct 15, 2025
465ce58
Update tests/compile/test_fusion.py
ProExpertProg Oct 15, 2025
bb0254a
Merge branch 'main' into luka/custom-op-matching-2
ProExpertProg Oct 15, 2025
bcd95b5
Fix func test
ProExpertProg Oct 15, 2025
db2b1c7
Smaller model for e2e fusion test
ProExpertProg Oct 15, 2025
a3ebf0a
fix fp8 quant tests
ProExpertProg Oct 15, 2025
3943257
Restore original torch.Parameter behavior in RMSNorm
ProExpertProg Oct 15, 2025
532cbcf
Add comment to test_logger
ProExpertProg Oct 15, 2025
7e6f5b3
add flat_product example
ProExpertProg Oct 15, 2025
24f1298
PR comments: cleanup fusion passes, & matching
ProExpertProg Oct 15, 2025
de7405b
PR comments: add _custom_op suffix
ProExpertProg Oct 15, 2025
6253d5b
Add e2e to L40 distributed, move tests to start of B200 distributed
ProExpertProg Oct 15, 2025
876ef22
Fix tests, PR feedback
ProExpertProg Oct 15, 2025
e99a759
Break up B200 tests, move allreduce to H200
ProExpertProg Oct 15, 2025
a226864
Merge branch 'main' into luka/custom-op-matching-2
ProExpertProg Oct 16, 2025
ae581e1
Fix attention fusion test numerics
ProExpertProg Oct 16, 2025
c03b29b
Remove inductor graph partition from unit test (included in e2e tests)
ProExpertProg Oct 16, 2025
d2e0489
Relax tolerance for L40 fusion test
ProExpertProg Oct 16, 2025
65ef5fd
Merge branch 'main' into luka/custom-op-matching-2
ProExpertProg Oct 16, 2025
d4fe977
Fix NamedTuple
ProExpertProg Oct 16, 2025
6319e39
Update test durations
ProExpertProg Oct 16, 2025
e34d36d
More tweaking of precision
ProExpertProg Oct 16, 2025
c27a182
Merge remote-tracking branch 'upstream/main' into luka/custom-op-matc…
ProExpertProg Oct 17, 2025
c4f913d
Removed TODO
ProExpertProg Oct 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 30 additions & 12 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -416,15 +416,16 @@ steps:
- pytest -v -s compile/test_basic_correctness.py
- pytest -v -s compile/piecewise/

- label: PyTorch Fullgraph Test # 20min
timeout_in_minutes: 30
- label: PyTorch Fullgraph Test # 22min
timeout_in_minutes: 35
mirror_hardwares: [amdexperimental]
torch_nightly: true
source_file_dependencies:
- vllm/
- tests/compile
commands:
- pytest -v -s compile/test_full_graph.py
- pytest -v -s compile/test_fusions_e2e.py

- label: Kernels Core Operation Test # 48min
timeout_in_minutes: 75
Expand Down Expand Up @@ -807,8 +808,8 @@ steps:
# Whisper needs spawn method to avoid deadlock
- VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/audio_language.py --model-type whisper

- label: Blackwell Test # 38 min
timeout_in_minutes: 60
- label: Blackwell Test # 21 min
timeout_in_minutes: 30
working_dir: "/vllm-workspace/"
gpu: b200
# optional: true
Expand All @@ -821,8 +822,6 @@ steps:
- vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
- vllm/v1/attention/backends/flashinfer.py
- vllm/compilation/fusion.py
- vllm/compilation/fusion_attn.py
commands:
- nvidia-smi
- python3 examples/offline_inference/basic/chat.py
Expand All @@ -839,15 +838,32 @@ steps:
- pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
- pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
- pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
- pytest -v -s tests/kernels/quantization/test_nvfp4_qutlass.py
- pytest -v -s tests/kernels/quantization/test_mxfp4_qutlass.py
- pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
- pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
# Fusion
- pytest -v -s tests/compile/test_fusion_all_reduce.py
- pytest -v -s tests/compile/test_fusion_attn.py::test_attention_quant_pattern
- pytest -v -s tests/kernels/moe/test_flashinfer.py

- label: Blackwell Fusion Tests # 30 min
timeout_in_minutes: 40
working_dir: "/vllm-workspace/"
gpu: b200
source_file_dependencies:
- csrc/quantization/fp4/
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
- vllm/v1/attention/backends/flashinfer.py
- vllm/compilation/
# can affect pattern matching
- vllm/model_executor/layers/layernorm.py
- vllm/model_executor/layers/activation.py
- vllm/model_executor/layers/quantization/input_quant_fp8.py
commands:
- nvidia-smi
- pytest -v -s tests/compile/test_fusion_attn.py
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
- pytest -v -s tests/kernels/quantization/test_nvfp4_qutlass.py
- pytest -v -s tests/kernels/quantization/test_mxfp4_qutlass.py
# this runner has 2 GPUs available even though num_gpus=2 is not set
- pytest -v -s tests/compile/test_fusion_all_reduce.py
- pytest -v -s tests/compile/test_fusions_e2e.py

- label: Blackwell GPT-OSS Eval
timeout_in_minutes: 60
Expand Down Expand Up @@ -1100,14 +1116,16 @@ steps:
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4

##### H200 test #####
- label: Distrubted Tests (H200) # optional
- label: Distributed Tests (H200) # optional
gpu: h200
optional: true
working_dir: "/vllm-workspace/"
num_gpus: 2
commands:
- pytest -v -s tests/compile/test_async_tp.py
- pytest -v -s tests/compile/test_sequence_parallelism.py
- pytest -v -s tests/compile/test_fusion_all_reduce.py
- pytest -v -s tests/compile/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm
- pytest -v -s tests/distributed/test_context_parallel.py
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048

Expand Down
2 changes: 2 additions & 0 deletions csrc/layernorm_kernels.cu
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,8 @@ void fused_add_rms_norm(torch::Tensor& input, // [..., hidden_size]
torch::Tensor& residual, // [..., hidden_size]
torch::Tensor& weight, // [hidden_size]
double epsilon) {
TORCH_CHECK(weight.scalar_type() == input.scalar_type());
TORCH_CHECK(input.scalar_type() == residual.scalar_type());
TORCH_CHECK(residual.is_contiguous());
TORCH_CHECK(weight.is_contiguous());
int hidden_size = input.size(-1);
Expand Down
2 changes: 2 additions & 0 deletions csrc/layernorm_quant_kernels.cu
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,8 @@ void fused_add_rms_norm_static_fp8_quant(
double epsilon) {
TORCH_CHECK(out.is_contiguous());
TORCH_CHECK(residual.is_contiguous());
TORCH_CHECK(residual.scalar_type() == input.scalar_type());
TORCH_CHECK(weight.scalar_type() == input.scalar_type());
int hidden_size = input.size(-1);
int input_stride = input.stride(-2);
int num_tokens = input.numel() / hidden_size;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,11 @@ void rms_norm_dynamic_per_token_quant(
if (scale_ub.has_value()) {
TORCH_CHECK(out.dtype() == kFp8Type);
}
TORCH_CHECK(weight.dtype() == input.dtype());
TORCH_CHECK(scales.dtype() == torch::kFloat32);
if (residual) {
TORCH_CHECK(residual->scalar_type() == input.scalar_type());
}

VLLM_DISPATCH_FLOATING_TYPES(
input.scalar_type(), "rms_norm_dynamic_per_token_quant_dispatch", [&] {
Expand Down
25 changes: 22 additions & 3 deletions tests/compile/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,22 @@

import weakref
from collections.abc import Callable, Sequence
from contextlib import nullcontext
from copy import deepcopy

import depyf
from torch import fx
from torch._ops import OpOverload
from torch.fx._utils import lazy_format_graph_code

from vllm.compilation.fx_utils import find_op_nodes
from vllm.compilation.inductor_pass import InductorPass
from vllm.compilation.pass_manager import with_pattern_match_debug
from vllm.compilation.vllm_inductor_pass import VllmInductorPass
from vllm.config import VllmConfig, get_current_vllm_config
from vllm.logger import init_logger

logger = init_logger("vllm.tests.compile.backend")


class LazyInitPass(InductorPass):
Expand Down Expand Up @@ -45,20 +51,32 @@ class TestBackend:

def __init__(self, *passes: InductorPass | Callable[[fx.Graph], None]):
self.custom_passes = list(passes)
compile_config = get_current_vllm_config().compilation_config
self.inductor_config = compile_config.inductor_compile_config
vllm_config = get_current_vllm_config()
compile_config = vllm_config.compilation_config
# Deepcopy to allow multiple TestBackend instances to use the same VllmConfig
self.inductor_config = deepcopy(compile_config.inductor_compile_config)
self.inductor_config["force_disable_caches"] = True
self.inductor_config["post_grad_custom_post_pass"] = self.post_pass

if debug_dump_path := vllm_config.compile_debug_dump_path():
logger.debug("Dumping depyf output to %s", debug_dump_path)
self.debug_ctx = depyf.prepare_debug(debug_dump_path.as_posix())
else:
self.debug_ctx = nullcontext()

def __call__(self, graph: fx.GraphModule, example_inputs):
self.graph_pre_compile = deepcopy(graph)
from torch._inductor.compile_fx import compile_fx

return compile_fx(graph, example_inputs, config_patches=self.inductor_config)
with self.debug_ctx:
return compile_fx(
graph, example_inputs, config_patches=self.inductor_config
)

@with_pattern_match_debug
def post_pass(self, graph: fx.Graph):
self.graph_pre_pass = deepcopy(graph)
lazy_format_graph_code("graph_pre_pass", graph.owning_module)

VllmInductorPass.dump_prefix = 0
for pass_ in self.custom_passes:
Expand All @@ -68,6 +86,7 @@ def post_pass(self, graph: fx.Graph):
VllmInductorPass.dump_prefix = None

self.graph_post_pass = deepcopy(graph)
lazy_format_graph_code("graph_post_pass", graph.owning_module)
# assign by reference, will reflect the final state of the graph
self.final_graph = graph

Expand Down
Loading