-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Model] Revert PR #26715: Restore custom PaliGemma and Gemma3-MM impl… #27309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ementations This reverts commit 8c017b3. Reason for revert: PR #26715 breaks vLLM TPU support for PaliGemma and Gemma3-MM models. The Transformers backend approach is incompatible with TPU deployments because vLLM's TPU backend requires torchax integration, which cannot interoperate with HuggingFace Transformers' model implementations. Technical details: - vLLM TPU support relies on torchax (PyTorch/XLA integration) - torchax requires direct control over model operations and kernels - HF Transformers backend abstracts away this control layer - This abstraction prevents torchax from properly optimizing TPU execution Impact: TPU inference is completely broken (fails to initialize) This revert restores the custom vLLM implementations that are compatible with torchax, enabling TPU deployments while we work on a proper solution for the attention mask issue. Signed-off-by: Luciano Martins <[email protected]>
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
Documentation preview: https://vllm--27309.org.readthedocs.build/en/27309/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| if config.text_config.model_type == "gemma": | ||
| config.text_config.architectures = ["GemmaForCausalLM"] | ||
| else: | ||
| config.text_config.architectures = ["Gemma2ForCausalLM"] | ||
| self.language_model = init_vllm_registered_model( | ||
| vllm_config=vllm_config, | ||
| hf_config=config.text_config, | ||
| prefix=maybe_prefix(prefix, "language_model"), | ||
| ) | ||
| logit_scale = getattr(config, "logit_scale", 1.0) | ||
| self.language_model.logits_processor.scale *= logit_scale | ||
|
|
||
| self.make_empty_intermediate_tensors = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard Paligemma logits processor access
Paligemma initialisation unconditionally multiplies self.language_model.logits_processor.scale by the model’s logit_scale. In scenarios where init_vllm_registered_model returns a pooling variant (e.g. when auto_convert_to_pooling is enabled), the returned language model does not expose a logits_processor attribute. The comparable Gemma3 implementation guards this attribute before touching it, but Paligemma will raise an AttributeError at load time and the model cannot be constructed. Please add an hasattr(...) check (or skip the scaling when the logits processor is missing) so pooling conversions still work.
Useful? React with 👍 / 👎.
This is not really how the Transformers backend works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to block this until we actually understand what the problem is.
|
What it actually looks like is something like this: # in vllm
class TransformersForMultimodalLM(nn.Module):
def __init__(self, ...):
self.model = AutoModel.from_config(...)
# some modules in self.model are then substituted for vLLM modules
def forward(self, ...):
x = self.model(input_ids)
return self.lm_head(x)# in transformers
class Gemma3Model(nn.Module):
def forward(self, ...):
x = self.embed(input_ids)
x = self.layers(x)
return xThere is no abstraction layer during the forward pass so I'm curious to know why |
|
Let's get some traces from the failed compilation, but my two cents is that model definition uses conditionals and/or ops that hinder compilation. |
|
As mentioned offline, the models actually didn't work correctly even prior to #26715 because of the incorrect attention mask. If you think it's still important to support these models (despite the correctness issues) for the upcoming release, I'm ok with reverting it. |
|
@hmellor and @NickLucche: Doing more tests, I think the issue is related to tensor shape incompatibility between HuggingFace Transformers and vLLM-TPU's einsum operations, not (necessarily) a compilation problem. Actual Error Trace (TPU Environment)When running Gemma3-MM with PR #26715 changes on TPU v6e-4: Error location: Expected vs Actual:
Tensor Shape IncompatibilityWhy Transformers Backend Outputs 3D TensorsIn HuggingFace Transformers (PR #26715 uses this): # transformers/models/gemma3/modeling_gemma3.py:308
class Gemma3Attention(nn.Module):
def forward(self, hidden_states, ...):
# hidden_states: (batch_size, seq_len, hidden_dim) - 3D
query_states = self.q_proj(hidden_states) # Still 3D
# PyTorch's nn.Linear handles 3D automaticallyIn vLLM-TPU einsum layer: # tpu_inference/layers/vllm/quantization/unquantized.py:129
def apply(self, x, weight):
outs = jnp.einsum("mn,pn->mp", x_jax, weight_jax)
# Expects x_jax: 2D (m, n)
# Receives: 3D (batch, seq_len, hidden)
# ValueError: wrong number of indicesWhy Custom vLLM Implementation WorkedIn vLLM-TPU branch ( Located at class PaliGemmaForConditionalGeneration(nn.Module):
def __init__(self, ...):
# Uses custom vLLM models, not HF Transformers
self.vision_tower = SiglipVisionModel(...) # vLLM custom
self.language_model = init_vllm_registered_model(...) # vLLM customThese custom vLLM implementations contain tensor reshaping logic compatible with TPU einsum operations. In PR #26715: Located at class MultiModalMixin:
def get_multimodal_embeddings(self, **kwargs):
vision_embeddings = self.model.get_image_features(pixel_values, **kwargs)
# Calls HuggingFace Transformers directly
# Returns standard PyTorch 3D tensors
# vLLM-TPU einsum cannot handle theseCode Path ComparisonWorking Path (Custom vLLM Implementation)Input → Custom vLLM Model → Tensor Reshaping → 2D Tensor → vLLM-TPU einsum → SuccessExample from TPU branch: # Custom vLLM implementation handles reshaping
hidden_states = language_model.forward(...)
# Shape management for TPU compatibility happens hereBroken Path (PR #26715 Transformers Backend)Input → HF Transformers Model → 3D Tensor → vLLM-TPU einsum → ValueErrorFrom error trace: # HF Transformers outputs 3D
transformers/.../modeling_gemma3.py:308
query_states = self.q_proj(hidden_states) # 3D output
# vLLM-TPU expects 2D
tpu_inference/.../unquantized.py:129
outs = jnp.einsum("mn,pn->mp", x_jax, weight_jax) # Expects 2D
# ValueError: Einstein sum subscript 'mn' does not contain
# the correct number of indices for operand 0Testing MethodologyI manually applied PR #26715 changes to a clean vLLM-TPU installation: # Setup
pip install vllm-tpu==0.11.1 # Clean TPU installation
cd /tmp/vllm/vllm-pr26715 # PR #26715 checkout at commit 8c017b349
# Manual patch application for all changed files - ie.:
mkdir -p /path/to/vllm-tpu/vllm/model_executor/models/transformers
cp vllm/model_executor/models/transformers/* \
/path/to/vllm-tpu/vllm/model_executor/models/transformers/
# Test
python test_gemma3mm.py
# Result: ValueError at tpu_inference/.../unquantized.py:129I suggest to revert PR #26715 and I volunteer to help building it avoiding issues to VLLM-TPU. Evidence Summary
|
|
@DarkLight1337 - Actually they do work (or did work before PR #26715). I double checked on my v6e-4 + VLM 0.11.1 environment. Both text and multimodal inference worked fine (tested gemma3-4b-it, gemma3-12b-it and gemma3-27b-it). |
|
This additional information is helpful, but I don't think it reaches the issue. Main issues with the analysis:
|
|
By "working fine", do you only mean they can run successfully without failures? Or have you run a benchmark like lm-eval to check the accuracy? |
|
It appears that the issue is actually a hard coded 2D operation in Could we instead make a fix there so that the batch dimension doesn't cause errors? |
…mma3-MM impl… (vllm-project#27309) Signed-off-by: Luciano Martins <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Signed-off-by: jorgentrondsen <[email protected]>
…mma3-MM impl… (vllm-project#27309) Signed-off-by: Luciano Martins <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Signed-off-by: jorgentrondsen <[email protected]>
…mma3-MM impl… (vllm-project#27309) Signed-off-by: Luciano Martins <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Signed-off-by: jorgentrondsen <[email protected]>
…mma3-MM impl… (vllm-project#27309) Signed-off-by: Luciano Martins <[email protected]> Co-authored-by: Luciano Martins <[email protected]>
…mma3-MM impl… (vllm-project#27309) Signed-off-by: Luciano Martins <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
…mma3-MM impl… (vllm-project#27309) Signed-off-by: Luciano Martins <[email protected]> Co-authored-by: Luciano Martins <[email protected]>
…mma3-MM impl… (vllm-project#27309) Signed-off-by: Luciano Martins <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…mma3-MM impl… (vllm-project#27309) Signed-off-by: Luciano Martins <[email protected]> Co-authored-by: Luciano Martins <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Purpose
This PR reverts #26715 to restore vLLM TPU support for PaliGemma and Gemma3-MM models.
Background
PR #26715 removed custom vLLM implementations for PaliGemma and Gemma3-MM, forcing these models to use the HuggingFace Transformers backend due to special attention mask requirements. It completely broke TPU support.
Technical Root Cause
The Transformers backend is fundamentally incompatible with vLLM's TPU execution path:
vLLM TPU Architecture Requirements
Why Transformers Backend Breaks TPU
Concrete Impact
Motivation for Revert
Test Plan
ruff check .)mypy vllm)Test Results
Before Revert (with PR #26715)
After Revert (this PR)
Technical Verification
Root cause in code:
Why custom vLLM implementation works:
Checklist
[Model] Revert PR #26715: ...Related Issues
cc @DarkLight1337 @hmellor @NickLucche