[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds #24248

ilmarkov · 2025-09-04T12:23:45Z

First part of spliting #22086

Purpose

Add tunings of thresholds for Flashinfer allreduce fusion.

Adds a benchmark for allreduce fusion to determine input size thresholds for flashinfer allreduce.
Updates thresholds for flashinfer allreduce (as well as adding two shot algorithm usage when it has better performance) on Hopper and Blackwell devices

Moves allreduce out of moe_forward custom op in order to be able to match for fusion for moe models.

Test Plan

Added tests for non custom ops fusion

Based on #24604

Review link: https://github.com/vllm-project/vllm/pull/24248/files/6253d5bd143a1975213462e7d6c4f8d3a2e1fef7..7088940db26bdee8554418d92ea060279ea7f523

mergify · 2025-09-04T12:24:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request is a significant enhancement to the all-reduce fusion capabilities, adding support for matching native PyTorch operations in addition to custom ops. This greatly improves usability and performance flexibility. The introduction of a comprehensive benchmark for tuning fusion thresholds is also a valuable addition. The changes are extensive, particularly with the large number of new fusion patterns in vllm/compilation/collective_fusion.py. While the overall approach is sound, I've identified several critical issues in the implementation of these new patterns. Specifically, the return values from some pattern and replacement functions appear to be incorrect, which could lead to fusion failures or incorrect model outputs. I've provided detailed comments and suggestions for these issues. The configuration updates and the new benchmark script are well-implemented and welcome improvements.

gemini-code-assist · 2025-09-04T12:27:26Z

vllm/compilation/collective_fusion.py

The return values from the replacement function are incorrect. The pattern returns (rms_output, allreduce_output), which correspond to the normalized output and the all-reduced tensor. The replacement function should return the same structure.

auto_functionalized(flashinfer_trtllm_fused_allreduce_norm, ...) returns a tuple of 5 mutated arguments: (allreduce_in, residual, norm_out, quant_out, scale_out).

The rms_result corresponds to norm_out, which is allreduce[2].
The allreduce_in (which is input to the replacement function) corresponds to allreduce[0].

Therefore, the return statement should be return allreduce[2], allreduce[0].

The current code returns allreduce[3], allreduce[1], which corresponds to (quant_out, residual). This is incorrect and will lead to fusion failures or wrong results.

Suggested change

return allreduce[3], allreduce[1]

return allreduce[2], allreduce[0]

gemini-code-assist · 2025-09-04T12:27:26Z

vllm/compilation/collective_fusion.py

The return values from the replacement function are incorrect. The pattern returns (rms_output, rms_residual), which are the normalized output and the residual output. The replacement function should return the same structure.

When norm_out=None is passed to flashinfer_trtllm_fused_allreduce_norm, the allreduce_in tensor is used as the output buffer for the normalization result and is mutated. auto_functionalized will return a tuple where the first element (allreduce[0]) is the mutated allreduce_in (i.e., norm_out), and the second element (allreduce[1]) is the mutated residual.

Therefore, the correct return should be return allreduce[0], allreduce[1].

The current code returns allreduce[1], allreduce[2], which corresponds to (residual, norm_out). Since norm_out is None in the call, this is incorrect.

Suggested change

return allreduce[1], allreduce[2]

return allreduce[0], allreduce[1]

nvpohanh · 2025-09-05T01:18:28Z

vllm/compilation/collective_fusion.py

Just curious: why is the threshold still so low for TP8? I think AR+Norm should have pretty good perf up to some larger message sizes for TP8?

vllm/compilation/collective_fusion.py

nvpohanh · 2025-09-05T01:24:01Z

vllm/config/compilation.py

why is it 1MB for TP8?

@nvpohanh Here are the results for TP=8 Blackwell with torch symm mem (VLLM_ALLREDUCE_USE_SYMM_MEM=1) enabled (see the set of results below). I used the best performant alternative to fused allreduce. Probably, we can condition on it checking if symm mem is available and enabled, it will overcomplicate the configuration, in my opinion. Compared default allreduce flashinfer fused alternative is not significantly better in 4-16MB region (see results in the end)

Symm mem enabled

World Size: 8
Hidden Dimension: 8192
Warmup Iterations: 5
Benchmark Trials: 20
Quantization Mode: none

Configuration: seq_len=32, dtype=bfloat16, no residual

Input Size: 0.50 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.029 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.030 0.99x

Flashinfer Fused Allreduce Rmsnorm Oneshot 0.012 2.39x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.086 0.34x

Configuration: seq_len=64, dtype=bfloat16, no residual

Input Size: 1.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.030 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.030 0.99x

Flashinfer Fused Allreduce Rmsnorm Oneshot 0.018 1.62x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.056 0.54x

Configuration: seq_len=128, dtype=bfloat16, no residual

Input Size: 2.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.023 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.024 0.99x

Flashinfer Fused Allreduce Rmsnorm Oneshot 0.033 0.71x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.052 0.45x

Configuration: seq_len=256, dtype=bfloat16, no residual

Input Size: 4.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.031 0.97x

Standard Allreduce Rmsnorm Native Compiled 0.030 baseline

Flashinfer Fused Allreduce Rmsnorm Oneshot 0.064 0.47x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.050 0.60x

Configuration: seq_len=256, dtype=bfloat16, no residual

Input Size: 4.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.031 0.97x

Standard Allreduce Rmsnorm Native Compiled 0.030 baseline

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.049 0.61x

Configuration: seq_len=512, dtype=bfloat16, no residual

Input Size: 8.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.044 0.98x

Standard Allreduce Rmsnorm Native Compiled 0.043 baseline

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.297 0.15x

Configuration: seq_len=1024, dtype=bfloat16, no residual

Input Size: 16.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.071 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.077 0.93x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.109 0.66x

Configuration: seq_len=2048, dtype=bfloat16, no residual

Input Size: 32.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.135 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.143 0.94x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.205 0.66x

Default allreduce

Configuration: seq_len=32, dtype=bfloat16, no residual

Input Size: 0.50 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.029 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.030 0.99x

Flashinfer Fused Allreduce Rmsnorm Oneshot 0.012 2.44x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.087 0.34x

Configuration: seq_len=64, dtype=bfloat16, no residual

Input Size: 1.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.030 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.030 1.00x

Flashinfer Fused Allreduce Rmsnorm Oneshot 0.019 1.63x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.056 0.54x

Configuration: seq_len=128, dtype=bfloat16, no residual

Input Size: 2.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.032 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.032 1.00x

Flashinfer Fused Allreduce Rmsnorm Oneshot 0.033 0.97x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.052 0.62x

Configuration: seq_len=256, dtype=bfloat16, no residual

Input Size: 4.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.051 0.98x

Standard Allreduce Rmsnorm Native Compiled 0.050 baseline

Flashinfer Fused Allreduce Rmsnorm Oneshot 0.064 0.77x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.050 1.00x

Configuration: seq_len=512, dtype=bfloat16, no residual

Input Size: 8.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.079 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.081 0.97x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.068 1.17x

Configuration: seq_len=1024, dtype=bfloat16, no residual

Input Size: 16.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.119 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.125 0.95x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.109 1.09x

Configuration: seq_len=2048, dtype=bfloat16, no residual

Input Size: 32.00 MB

Operation Time (ms) Speedup

Standard Allreduce Rmsnorm 0.195 1.00x

Standard Allreduce Rmsnorm Native Compiled 0.211 0.93x

Flashinfer Fused Allreduce Rmsnorm Twoshot 0.204 0.96x

@ilmarkov Is VLLM_ALLREDUCE_USE_SYMM_MEM=1 something that normal vLLM users would set by default? If it's good for performance, why can't we enable it by default? Does it require special environment or special builds? cc @ProExpertProg

@nvjullin Could you check if @ilmarkov 's measurements above match our understanding? Also, could you try if VLLM_ALLREDUCE_USE_SYMM_MEM=1 works in our case? Thanks!

Yes, it can be enabled by default. There is a PR for it. It works on Hopper and Blackwell.

Got it! we will try both your PRs and run some experiments on our side.

@ilmarkov Just to clarify: the PyTorch SYMM_MEM implementation does not support AR+Norm fusion, right? So only the AR part uses SYMM_MEM while Norm part is based on native PyT?

Yes, symm mem is only for allreduce part, Norm and quantization parts are in native pytorch.

nvpohanh · 2025-09-05T01:24:35Z

cc @nvjullin @elvischenv for vis

mergify · 2025-09-10T03:32:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/compilation/collective_fusion.py

nvpohanh · 2025-10-09T01:19:08Z

Hi @ilmarkov , is there any progress and ETA for this change? Thanks!

ilmarkov · 2025-10-09T15:03:53Z

Hi, @nvpohanh . @ProExpertProg works on general custom op matching in #24604. So we will apply allreduce related pattern matching after his PR is landed. I mark current PR as draft for now.

Signed-off-by: Luka Govedič <[email protected]>

Signed-off-by: ilmarkov <[email protected]>

mergify · 2025-10-16T20:14:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-10-16T20:23:10Z

vllm/compilation/fusion.py

 def empty_bf16(*args, **kwargs):
-    return torch.empty(*args, **kwargs, dtype=torch.bfloat16, device="cuda")
+    return torch.empty(*args, **kwargs, dtype=torch.float16, device="cuda")


Restore bfloat16 in pattern placeholders

The helper empty_bf16 now creates tensors with torch.float16 instead of torch.bfloat16. This helper is used throughout the fusion passes (e.g. attention and activation fusion) to trace the FX patterns that should match bfloat16 graphs. Tracing the pattern in float16 means the captured graph contains dtype-specific ops (such as implicit casts) that no longer match the bfloat16 graphs emitted by models, so bfloat16 models will stop triggering these fusion passes. The helper should keep returning torch.bfloat16 to ensure the traced pattern matches bfloat16 execution.

Useful? React with 👍 / 👎.

ProExpertProg

Can we also add a test for the default setting of the config param?

ProExpertProg · 2025-10-17T16:13:43Z

tests/compile/test_fusion_all_reduce.py

        )
        backend.check_before_ops(model.ops_in_model_before(), fully_replaced=False)
        backend.check_after_ops(model.ops_in_model_after())
-        del all_reduce_fusion_pass


Unnecessary change?

ProExpertProg · 2025-10-17T16:15:33Z

vllm/compilation/collective_fusion.py


-import vllm.envs as envs
-from vllm.config import VllmConfig
+from vllm.config import VllmConfig, set_current_vllm_config


Is this used?

ProExpertProg · 2025-10-17T16:16:22Z

vllm/compilation/collective_fusion.py

        self.max_token_num = max_token_num
        self.fuse_rms_quant = fuse_rms_quant
-
+    


ProExpertProg · 2025-10-17T16:17:20Z

vllm/compilation/collective_fusion.py

+                fuse_rms_quant):
                # Do fused rms norm static fp8 quant fused op
                if norm_out is None:
                    torch.ops._C.fused_add_rms_norm_static_fp8_quant(


I think we should just always use the fused op - it should be faster

ProExpertProg · 2025-10-17T16:25:37Z

vllm/config/compilation.py

+    fi_allreduce_fusion_max_size_mb: dict[int,
+                                          float] = field(default_factory=dict)


Suggested change

fi_allreduce_fusion_max_size_mb: dict[int,

float] = field(default_factory=dict)

fi_allreduce_fusion_max_size_mb: dict[int, float] = (

field(default_factory=lambda: deepcopy(resolve_obj_by_qualname("vllm.compilation.fusion_all_reduce._FI_ALLREDUCE_MAX_INPUT_SIZES"))

)

Okay I see below it's more complex than that. what about:

Suggested change

fi_allreduce_fusion_max_size_mb: dict[int,

float] = field(default_factory=dict)

fi_allreduce_fusion_max_size_mb: dict[int, float] = (

field(default_factory=PassConfig.fi_allreduce_fusion_max_size_mb)

)

And then below we can define:

@staticmethod def default_fi_allreduce_fusion_max_size_mb(): if not current_platform.is_cuda(): return None from vllm.compilation.fusion_all_reduce import FI_ALLREDUCE_FUSION_MAX_SIZE_MB return deepcopy(FI_ALLREDUCE_FUSION_MAX_SIZE_MB)

ProExpertProg · 2025-10-17T16:26:44Z

vllm/config/compilation.py

+                4: 32 * MiB,  # 32MB
+                8: 1 * MiB,  # 1MB
+            },
+        }, where key is the device capability"""


Let's set the default dict to FI_ALLREDUCE_FUSION_MAX_SIZE_MB and then in __post_init__ we can do:

self.fi_allreduce_fusion_max_size_mb = {**FI_ALLREDUCE_FUSION_MAX_SIZE_MB, **self.fi_allreduce_fusion_max_size_mb}

cc @hmellor would this work? Or should we just generate this docstring from _FI_ALLREDUCE_MAX_INPUT_SIZES?

As far as I know, docstrings cannot be generated like that

ProExpertProg · 2025-10-17T16:34:37Z

vllm/config/compilation.py

+        device_capability = current_platform.get_device_capability(
+        ).as_version_str()
+        fi_allreduce_fusion_max_size_mb = \
+            self.fi_allreduce_fusion_max_size_mb.get(device_capability, {})


I thought the dict was already platform specific?

ProExpertProg · 2025-10-17T16:36:25Z

vllm/model_executor/layers/fused_moe/layer.py

+                assert not isinstance(fused_output, tuple)
            else:
-                shared_output, fused_output = torch.ops.vllm.moe_forward_shared(
+                fused_output = torch.ops.vllm.moe_forward(


Is there a reason we're changing moe_forward_shared to moe_forward

It's in the branch where self.shared_experts is None

ProExpertProg · 2025-10-17T16:37:45Z

vllm/model_executor/layers/fused_moe/layer.py

+                states = self.maybe_all_reduce_tensor_model_parallel(states)
+            return states
+
+        if self.shared_experts is not None:


I guess why invert the logic, seems like the diff is harder to parse due to it (is this because it got inverted in main)?

If yes could you restore it so it's easier to read?

We use the same orider of the logic as in the forward_impl custom op from which we move the reduction.

ProExpertProg · 2025-10-17T16:40:47Z

vllm/model_executor/layers/fused_moe/layer.py

                )
-            return fused_output[..., :og_hidden_states]
+            return (
+                reduce_output(shared_output[..., :og_hidden_states], do_combine=False),


Where does this slice come from?

Apparently, moe_forward can return larger tensor than expected. Probably, because of padding

I think this is where the padding is added

vllm/vllm/model_executor/layers/fused_moe/layer.py

Lines 2119 to 2131 in 6c728f7

def forward_native(

self,

hidden_states: torch.Tensor,

router_logits: torch.Tensor,

) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]:

og_hidden_states = hidden_states.shape[-1]

if self.hidden_size != og_hidden_states:

hidden_states = F.pad(

hidden_states,

(0, self.hidden_size - og_hidden_states),

mode="constant",

value=0.0,

)

Signed-off-by: ilmarkov <[email protected]>

hmellor · 2025-10-21T14:18:48Z

vllm/config/compilation.py

+
+    @staticmethod
+    def default_fi_allreduce_fusion_max_size_mb() -> dict[int, float]:
+        from vllm.compilation.collective_fusion import FI_ALLREDUCE_FUSION_MAX_SIZE_MB


Docs build is failing because this import now happens when running --help and vllm.compilation.collective_fusion includes a bunch more heavy imports

bnellnm · 2025-10-21T15:16:50Z

vllm/model_executor/layers/fused_moe/layer.py

+                and (self.tp_size > 1 or self.ep_size > 1)
+            ):
+                states = self.maybe_all_reduce_tensor_model_parallel(states)
+            return states


Maybe we should move the naive dispatch call out to this level also.

Also, the original callsites for naive dispatch/combine are inside a sequence parallel context. I'm not sure if that is going to cause problems.

ilmarkov requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners September 4, 2025 12:23

mergify bot added the performance Performance-related issues label Sep 4, 2025

mergify bot added the needs-rebase label Sep 4, 2025

ilmarkov mentioned this pull request Sep 4, 2025

[PERF] Allreduce Fusion tuning and compile_ranges introduction #22086

Closed

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

nvpohanh reviewed Sep 5, 2025

View reviewed changes

vllm/compilation/collective_fusion.py Outdated Show resolved Hide resolved

nvpohanh reviewed Sep 5, 2025

View reviewed changes

ilmarkov force-pushed the imarkov/fused_allreduce_torch_native branch from e808818 to 61ebc95 Compare September 8, 2025 12:02

mergify bot removed the needs-rebase label Sep 8, 2025

mergify bot added the needs-rebase label Sep 10, 2025

ProExpertProg reviewed Sep 12, 2025

View reviewed changes

vllm/compilation/collective_fusion.py Outdated Show resolved Hide resolved

ProExpertProg mentioned this pull request Sep 25, 2025

[Feature]: Enabling performance optimizations by default #25689

Open

11 tasks

ProExpertProg added this to the vllm==v0.12.0/torch==2.9.0 compilation improvements milestone Sep 25, 2025

ilmarkov marked this pull request as draft October 9, 2025 15:03

mergify bot removed the needs-rebase label Oct 15, 2025

ilmarkov force-pushed the imarkov/fused_allreduce_torch_native branch from 845b50b to 7088940 Compare October 15, 2025 19:52

ProExpertProg and others added 16 commits October 15, 2025 18:43

Fix tests, PR feedback

876ef22

Signed-off-by: Luka Govedič <[email protected]>

Break up B200 tests, move allreduce to H200

e99a759

Signed-off-by: Luka Govedič <[email protected]>

Merge branch 'main' into luka/custom-op-matching-2

a226864

Signed-off-by: Luka Govedič <[email protected]>

Fix attention fusion test numerics

ae581e1

Signed-off-by: Luka Govedič <[email protected]>

Remove inductor graph partition from unit test (included in e2e tests)

c03b29b

Signed-off-by: Luka Govedič <[email protected]>

Relax tolerance for L40 fusion test

d2e0489

Signed-off-by: Luka Govedič <[email protected]>

Merge branch 'main' into luka/custom-op-matching-2

65ef5fd

Fix NamedTuple

d4fe977

Signed-off-by: Luka Govedič <[email protected]>

Update test durations

6319e39

Signed-off-by: Luka Govedič <[email protected]>

More tweaking of precision

e34d36d

Signed-off-by: Luka Govedič <[email protected]>

Split original pr

f72ee43

Signed-off-by: ilmarkov <[email protected]>

Update bench

c4c0215

Signed-off-by: ilmarkov <[email protected]>

Update threshold configuration

309d79e

Signed-off-by: ilmarkov <[email protected]>

Move all_reduce from custom op in fused_moe

afcfd73

Signed-off-by: ilmarkov <[email protected]>

Linter fixes

0248dcd

Signed-off-by: ilmarkov <[email protected]>

Upd

18e4771

Signed-off-by: ilmarkov <[email protected]>

mergify bot added the needs-rebase label Oct 16, 2025

ilmarkov marked this pull request as ready for review October 16, 2025 20:15

chatgpt-codex-connector bot reviewed Oct 16, 2025

View reviewed changes

ProExpertProg reviewed Oct 17, 2025

View reviewed changes

ilmarkov added 2 commits October 21, 2025 12:56

Merge branch 'main' into imarkov/fused_allreduce_torch_native

1debd8e

Signed-off-by: ilmarkov <[email protected]>

Upd after review

9516d2b

Signed-off-by: ilmarkov <[email protected]>

ilmarkov force-pushed the imarkov/fused_allreduce_torch_native branch from 7088940 to 9516d2b Compare October 21, 2025 13:41

ilmarkov requested a review from jeejeelee as a code owner October 21, 2025 13:41

mergify bot removed the needs-rebase label Oct 21, 2025

hmellor reviewed Oct 21, 2025

View reviewed changes

bnellnm reviewed Oct 21, 2025

View reviewed changes

	return allreduce[3], allreduce[1]
	return allreduce[2], allreduce[0]

	return allreduce[1], allreduce[2]
	return allreduce[0], allreduce[1]

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.029	1.00x
Standard Allreduce Rmsnorm Native Compiled	0.030	0.99x
Flashinfer Fused Allreduce Rmsnorm Oneshot	0.012	2.39x
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.086	0.34x

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.023	1.00x
Standard Allreduce Rmsnorm Native Compiled	0.024	0.99x
Flashinfer Fused Allreduce Rmsnorm Oneshot	0.033	0.71x
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.052	0.45x

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.031	0.97x
Standard Allreduce Rmsnorm Native Compiled	0.030	baseline
Flashinfer Fused Allreduce Rmsnorm Oneshot	0.064	0.47x
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.050	0.60x

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.044	0.98x
Standard Allreduce Rmsnorm Native Compiled	0.043	baseline
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.297	0.15x

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.071	1.00x
Standard Allreduce Rmsnorm Native Compiled	0.077	0.93x
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.109	0.66x

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.135	1.00x
Standard Allreduce Rmsnorm Native Compiled	0.143	0.94x
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.205	0.66x

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.032	1.00x
Standard Allreduce Rmsnorm Native Compiled	0.032	1.00x
Flashinfer Fused Allreduce Rmsnorm Oneshot	0.033	0.97x
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.052	0.62x

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.051	0.98x
Standard Allreduce Rmsnorm Native Compiled	0.050	baseline
Flashinfer Fused Allreduce Rmsnorm Oneshot	0.064	0.77x
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.050	1.00x

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.079	1.00x
Standard Allreduce Rmsnorm Native Compiled	0.081	0.97x
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.068	1.17x

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.119	1.00x
Standard Allreduce Rmsnorm Native Compiled	0.125	0.95x
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.109	1.09x

Operation	Time (ms)	Speedup
Standard Allreduce Rmsnorm	0.195	1.00x
Standard Allreduce Rmsnorm Native Compiled	0.211	0.93x
Flashinfer Fused Allreduce Rmsnorm Twoshot	0.204	0.96x

		self.max_token_num = max_token_num
		self.fuse_rms_quant = fuse_rms_quant

		fi_allreduce_fusion_max_size_mb: dict[int,
		float] = field(default_factory=dict)

	def forward_native(
	self,
	hidden_states: torch.Tensor,
	router_logits: torch.Tensor,
	) -> torch.Tensor \| tuple[torch.Tensor, torch.Tensor]:
	og_hidden_states = hidden_states.shape[-1]
	if self.hidden_size != og_hidden_states:
	hidden_states = F.pad(
	hidden_states,
	(0, self.hidden_size - og_hidden_states),
	mode="constant",
	value=0.0,
	)

Uh oh!

[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds #24248

Are you sure you want to change the base?

[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds #24248

Conversation

ilmarkov commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

mergify bot commented Sep 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Symm mem enabled

Configuration: seq_len=32, dtype=bfloat16, no residual

Configuration: seq_len=64, dtype=bfloat16, no residual

Configuration: seq_len=128, dtype=bfloat16, no residual

Configuration: seq_len=256, dtype=bfloat16, no residual

Configuration: seq_len=256, dtype=bfloat16, no residual

Configuration: seq_len=512, dtype=bfloat16, no residual

Configuration: seq_len=1024, dtype=bfloat16, no residual

Configuration: seq_len=2048, dtype=bfloat16, no residual

Default allreduce

Configuration: seq_len=32, dtype=bfloat16, no residual

Configuration: seq_len=64, dtype=bfloat16, no residual

Configuration: seq_len=128, dtype=bfloat16, no residual

Configuration: seq_len=256, dtype=bfloat16, no residual

Configuration: seq_len=512, dtype=bfloat16, no residual

Configuration: seq_len=1024, dtype=bfloat16, no residual

Configuration: seq_len=2048, dtype=bfloat16, no residual

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilmarkov Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvpohanh commented Sep 5, 2025

Uh oh!

mergify bot commented Sep 10, 2025

Uh oh!

Uh oh!

nvpohanh commented Oct 9, 2025

Uh oh!

ilmarkov commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 16, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ilmarkov commented Sep 4, 2025 •

edited by github-actions bot

Loading

ilmarkov Sep 5, 2025 •

edited

Loading

ilmarkov commented Oct 9, 2025 •

edited

Loading

bnellnm Oct 21, 2025 •

edited

Loading