-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[Bugfix][Multi Modal] Fix incorrect output in Molmo #26518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: yujiaqi <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a spatial positioning bug in the Molmo model by reintroducing image_input_idx and the patch reordering logic. The changes are well-structured and include a fallback for backward compatibility. I have one suggestion to improve the efficiency and readability of the new patch reordering implementation.
| # Create output tensor with spatial ordering | ||
| max_idx = valid_indices.max().item() | ||
| output_features = torch.zeros( | ||
| max_idx + 1, | ||
| image_features.shape[-1], | ||
| dtype=image_features.dtype, | ||
| device=image_features.device, | ||
| ) | ||
|
|
||
| # Place features at their spatial positions | ||
| output_features[valid_indices] = valid_features | ||
|
|
||
| # Return only the positions that have features | ||
| occupied_positions = torch.zeros( | ||
| max_idx + 1, dtype=torch.bool, device=image_features.device | ||
| ) | ||
| occupied_positions[valid_indices] = True | ||
|
|
||
| return output_features[occupied_positions] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for reordering patches is a bit complex and potentially inefficient. It creates two large intermediate tensors (output_features and occupied_positions) with a size determined by max_idx, which could be large and consume unnecessary memory. A more direct and memory-efficient approach is to sort the valid features directly using their spatial indices. This improves both readability and performance.
# Spatially reorder the valid features based on their indices.
# Sorting is more direct and memory-efficient than creating large
# intermediate tensors for scattering and then gathering.
_, sort_indices = torch.sort(valid_indices)
return valid_features[sort_indices]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixing! Left several comments.
|
|
||
| # NOTE: In v1, inputs_embeds is always generated at model runner, this | ||
| # condition is for v0 compatibility. | ||
| elif inputs_embeds is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we don't need to maintain the code for v0 now.
| height=target_height, | ||
| num_images=num_images, | ||
| overrides=image_overrides, | ||
| width=target_width, height=target_height, num_images=num_images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to change this part? I think it should be irrelevant.
|
Closing as superseded by #26563 |
Purpose
Fix incorrect spatial positioning output in Molmo multimodal model as reported in issue #26451.
The root cause is that PR #12966 removed the
image_input_idxfield during the TensorSchema migration, which eliminated the critical patch reordering logic. This caused image patches to remain in crop processing order instead of proper spatial order, leading to incorrect coordinate predictions for visual grounding tasks.Background
Molmo processes high-resolution images by splitting them into multiple crops (tiles) that are processed separately. The original implementation used
image_input_idxto maintain spatial relationships between patches:image_input_idxmaps patches back to their correct spatial positions in the final sequenceWithout this reordering, patches remain in crop processing order, breaking spatial continuity and causing coordinate prediction errors.
Changes Made
image_input_idxfield toMolmoImageInputsTensorSchema as an optional fieldimage_input_idxin the processing pipeline_process_image_input()methodimage_input_idxis unavailable to maintain backward compatibilityTest Plan
Test Script:
Test Results
Before fix (incorrect coordinates):
After fix (correct coordinates):
The fixed coordinates
x="69.0" y="48.3"now align with the expected output from the reference Transformers implementation, representing the correct spatial location of the car in the image.Verification
image_input_idxfieldimage_input_idxis available)Related Issues
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.