-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Enable RMSNorm substitution for Transformers backend
#26353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable RMSNorm substitution for Transformers backend
#26353
Conversation
This change should enable quant fusions which depend on the `RMSNorm` op being present Signed-off-by: Harry Mellor <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
I've reviewed your changes to enable RMSNorm substitution. The approach to differentiate between RMSNorm and GemmaRMSNorm is clever, but I've identified a critical issue with handling weightless norms and a potential robustness improvement. Please see my detailed comments below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
| if weight_test is not None and torch.all(weight_test == 0): | ||
| return GemmaRMSNorm(**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply check the existence of _norm function for GemmaRMSNorm?
https://github.com/huggingface/transformers/blob/50090c3fc82e1e0a06b4da366ea2fb6055d529e9/src/transformers/models/gemma3n/modeling_gemma3n.py#L123-L124
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work for Gemma models as they are currently implemented in Transformers. However:
- Custom models may not use this pattern
_normis a private method and so may change under us- A counter-example would be Moshi, which implements
_normbut doesx * winstead ofx * (1 + w)
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...
…26353) Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…26353) Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>
…26353) Signed-off-by: Harry Mellor <[email protected]>
…26353) Signed-off-by: Harry Mellor <[email protected]>
…26353) Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…26353) Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…26353) Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…26353) Signed-off-by: Harry Mellor <[email protected]>
This change should enable quant fusions which depend on the
RMSNormop being present