[Bugfix][Multi Modal] Fix incorrect output in Molmo #26518

piood · 2025-10-09T18:34:49Z

Purpose

Fix incorrect spatial positioning output in Molmo multimodal model as reported in issue #26451.

The root cause is that PR #12966 removed the image_input_idx field during the TensorSchema migration, which eliminated the critical patch reordering logic. This caused image patches to remain in crop processing order instead of proper spatial order, leading to incorrect coordinate predictions for visual grounding tasks.

Background

Molmo processes high-resolution images by splitting them into multiple crops (tiles) that are processed separately. The original implementation used image_input_idx to maintain spatial relationships between patches:

Image tiling: High-res images are divided into overlapping crops (e.g., 4×3 grid = 12 crops + 1 global thumbnail)
Patch processing: Each crop generates patches processed by the vision backbone
Spatial reordering: image_input_idx maps patches back to their correct spatial positions in the final sequence

Without this reordering, patches remain in crop processing order, breaking spatial continuity and causing coordinate prediction errors.

Changes Made

Restored image_input_idx field to MolmoImageInputs TensorSchema as an optional field
Updated multimodal field configuration to include image_input_idx in the processing pipeline
Reimplemented patch reordering logic in _process_image_input() method
Added fallback behavior when image_input_idx is unavailable to maintain backward compatibility

Test Plan

Test Script:

from vllm import LLM
from vllm.sampling_params import SamplingParams
import requests
from PIL import Image

model = LLM(
    model="allenai/Molmo-7B-D-0924",
    trust_remote_code=True,
    dtype='bfloat16',
    gpu_memory_utilization=0.95,
)

sampling_params = SamplingParams(max_tokens=64, temperature=0)

image_url = "https://www.visitscotland.com/binaries/content/gallery/visitscotland/cms-images/2022/06/24/clashnessie-bay-car-road"
image = Image.open(requests.get(image_url, stream=True).raw)

inputs = [{
    "prompt": "Point to the car.",
    "multi_modal_data": {"image": image},
}]

outputs = model.generate(inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Test Results

Before fix (incorrect coordinates):

<point x="84.0" y="51.5" alt="car">car</point>

After fix (correct coordinates):

<point x="69.0" y="48.3" alt="car">car</point>

The fixed coordinates x="69.0" y="48.3" now align with the expected output from the reference Transformers implementation, representing the correct spatial location of the car in the image.

Verification

✅ Coordinates now match reference implementation (HuggingFace Transformers)
✅ Backward compatibility maintained via optional image_input_idx field
✅ No breaking changes to existing API
✅ Performance impact minimal (only applies when image_input_idx is available)

Related Issues

Fixes [Bug]: Molmo produces incorrect outputs #26451: Molmo produces wrong coordinates in vLLM 0.7.3+
Related to PR [VLM] Merged multi-modal processor for Molmo #12966: TensorSchema migration that inadvertently removed patch reordering
Related to PR [BugFix][Multi Modal] Fix TensorSchema shape mismatch in Molmo #24559: Previous partial fix attempt that didn't address the core issue

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: yujiaqi <[email protected]>

github-actions · 2025-10-09T18:34:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request correctly fixes a spatial positioning bug in the Molmo model by reintroducing image_input_idx and the patch reordering logic. The changes are well-structured and include a fallback for backward compatibility. I have one suggestion to improve the efficiency and readability of the new patch reordering implementation.

gemini-code-assist · 2025-10-09T18:40:50Z

vllm/model_executor/models/molmo.py

+        # Create output tensor with spatial ordering
+        max_idx = valid_indices.max().item()
+        output_features = torch.zeros(
+            max_idx + 1,
+            image_features.shape[-1],
+            dtype=image_features.dtype,
+            device=image_features.device,
+        )
+
+        # Place features at their spatial positions
+        output_features[valid_indices] = valid_features
+
+        # Return only the positions that have features
+        occupied_positions = torch.zeros(
+            max_idx + 1, dtype=torch.bool, device=image_features.device
+        )
+        occupied_positions[valid_indices] = True
+
+        return output_features[occupied_positions]


The current implementation for reordering patches is a bit complex and potentially inefficient. It creates two large intermediate tensors (output_features and occupied_positions) with a size determined by max_idx, which could be large and consume unnecessary memory. A more direct and memory-efficient approach is to sort the valid features directly using their spatial indices. This improves both readability and performance.

# Spatially reorder the valid features based on their indices. # Sorting is more direct and memory-efficient than creating large # intermediate tensors for scattering and then gathering. _, sort_indices = torch.sort(valid_indices) return valid_features[sort_indices]

wwl2755

Thanks for the fixing! Left several comments.

wwl2755 · 2025-10-09T19:25:43Z

vllm/model_executor/models/molmo.py


+        # NOTE: In v1, inputs_embeds is always generated at model runner, this
+        # condition is for v0 compatibility.
+        elif inputs_embeds is None:


I believe we don't need to maintain the code for v0 now.

wwl2755 · 2025-10-09T19:26:31Z

vllm/model_executor/models/molmo.py

-                height=target_height,
-                num_images=num_images,
-                overrides=image_overrides,
+                width=target_width, height=target_height, num_images=num_images


Why do you want to change this part? I think it should be irrelevant.

DarkLight1337 · 2025-10-11T05:28:49Z

Closing as superseded by #26563

[Bugfix][Multi Modal] Fix incorrect output in Molmo

b7a305f

Signed-off-by: yujiaqi <[email protected]>

piood mentioned this pull request Oct 9, 2025

[Bug]: Molmo produces incorrect outputs #26451

Closed

1 task

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

wwl2755 reviewed Oct 9, 2025

View reviewed changes

sangho-vision mentioned this pull request Oct 10, 2025

[Bugfix][Multi Modal] Fix incorrect Molmo image processing #26563

Merged

vllm-bot closed this in #26563 Oct 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][Multi Modal] Fix incorrect output in Molmo #26518

[Bugfix][Multi Modal] Fix incorrect output in Molmo #26518

Uh oh!

piood commented Oct 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 9, 2025

Uh oh!

wwl2755 left a comment •

edited

Loading

Uh oh!

wwl2755 Oct 9, 2025

Uh oh!

wwl2755 Oct 9, 2025

Uh oh!

DarkLight1337 commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Bugfix][Multi Modal] Fix incorrect output in Molmo #26518

[Bugfix][Multi Modal] Fix incorrect output in Molmo #26518

Uh oh!

Conversation

piood commented Oct 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Background

Changes Made

Test Plan

Test Results

Verification

Related Issues

Uh oh!

github-actions bot commented Oct 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

wwl2755 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwl2755 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

wwl2755 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

piood commented Oct 9, 2025 •

edited by github-actions bot

Loading

wwl2755 left a comment •

edited

Loading