-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Model] Add MoE support for NemotronH #25863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…edReLu activation - adapt the FusedMoE object to support is_act_and_mul=False Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
…s an attribute in FusedMoE Signed-off-by: Tomer Asida <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for a non-gated Squared ReLU MoE module in the NemotronH architecture, which is a valuable enhancement. The changes are mostly well-implemented across the fused MoE layers and model definition. However, I've identified a critical bug in the forward pass of the new NemotronHMoE module related to incorrect floating-point computation and a potential UnboundLocalError. I've provided a detailed comment with a suggested fix for this issue. Addressing this is crucial for the correctness of the model's output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the reviewer(s)
NemotronHForCausalLM now optionally has an MoE block. I was wondering if it should implement the MixtureOfExperts interface or not. Do you have any guidance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to something similar to this PR #25311 (comment), where is_mixture_of_experts depends on an attribute of the model. I don't know all the cases where this is used though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the input. Done
RE your comment in the other PR - I think that checking whether getattr(model, "num_moe_layers", 0) > 0 in is_mixture_of_experts makes sense, since all models implementing MixtureOfExperts are expected to initialize num_moe_layers as it is part of the interface. So I don't think it is too fragile.
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
…xperts Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
bf2285e to
7fff9a8
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
| if not self.moe_config.is_act_and_mul: | ||
| # Avoid circular import | ||
| from vllm.model_executor.layers.quantization.modelopt import ( | ||
| ModelOptFp8MoEMethod, | ||
| ) | ||
|
|
||
| if not isinstance( | ||
| quant_method, (UnquantizedFusedMoEMethod, ModelOptFp8MoEMethod) | ||
| ): | ||
| raise NotImplementedError( | ||
| "is_act_and_mul=False is supported only for unquantized " | ||
| "and ModelOpt FP8 moe for now" | ||
| ) | ||
| if not current_platform.is_cuda(): | ||
| raise NotImplementedError( | ||
| "is_act_and_mul=False is supported only for CUDA for now" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the blockers for supporting is_act_and_mul = False more generally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating the relevant kernels :) We plan to follow up with that
| if ( | ||
| envs.VLLM_USE_FLASHINFER_MOE_FP8 | ||
| and has_flashinfer_moe() | ||
| and self.moe.is_act_and_mul | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For NemotronH, self.flashinfer_moe_backend will end up being None. What implementation ends up getting used in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
triton kernels. This is currently the only code path available with is_act_and_mul=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is going to be very complicated to add to all the quant and kernel backends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We can follow-up on this discussion internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to something similar to this PR #25311 (comment), where is_mixture_of_experts depends on an attribute of the model. I don't know all the cases where this is used though
| num_redundant_experts=self.n_redundant_experts, | ||
| ) | ||
|
|
||
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DP+TP cases, we should use the sequence parallel trick like in #24982 to avoid duplicate work in the expert layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for the pointer :)
…nd returns True on is_mixture_of_experts(model) only if it actually has moe layers. Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
…o step_forward * 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits) [Model] Add MoE support for NemotronH (vllm-project#25863) [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245) [CI] Reorganize entrypoints tests (vllm-project#27403) add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525) [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388) [Bugfix] Fix args settings for guided decoding args (vllm-project#27375) [CI/Build] Fix Prithvi plugin test (vllm-project#27393) [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372) [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378) [V1][spec decode] return logprobs for spec decoding (vllm-project#26060) [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219) [Bugfix][Core] running queue index leakage exception (vllm-project#26754) [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133) [Bugfix] Fix SLA tuner initialization (vllm-project#27355) [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361) [MLA] Bump FlashMLA (vllm-project#27354) [Chore] Separate out system utilities from vllm.utils (vllm-project#27201) [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128) [Feature] publisher default set zmq in kv_event config (vllm-project#26915) [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211) ...
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
Purpose
Add support for an MoE module in the NemotronH architecture.
This MoE module is relatively unique (to the best of my knowledge, comparable only to nomic-ai/nomic-embed-text-v2-moe), as it uses a non-gated Squared ReLU activation function.
In this PR:
NemotronHMoEmodule to the NemotronH modeling fileFusedMoEclass (in addition to by calling thefused_moefunction directly)ModelOptFp8MoEMethodquant_method, currently only in the triton path