Skip to content

Conversation

@piood
Copy link
Contributor

@piood piood commented Oct 9, 2025

Purpose

Fix incorrect spatial positioning output in Molmo multimodal model as reported in issue #26451.

The root cause is that PR #12966 removed the image_input_idx field during the TensorSchema migration, which eliminated the critical patch reordering logic. This caused image patches to remain in crop processing order instead of proper spatial order, leading to incorrect coordinate predictions for visual grounding tasks.

Background

Molmo processes high-resolution images by splitting them into multiple crops (tiles) that are processed separately. The original implementation used image_input_idx to maintain spatial relationships between patches:

  1. Image tiling: High-res images are divided into overlapping crops (e.g., 4×3 grid = 12 crops + 1 global thumbnail)
  2. Patch processing: Each crop generates patches processed by the vision backbone
  3. Spatial reordering: image_input_idx maps patches back to their correct spatial positions in the final sequence

Without this reordering, patches remain in crop processing order, breaking spatial continuity and causing coordinate prediction errors.

Changes Made

  1. Restored image_input_idx field to MolmoImageInputs TensorSchema as an optional field
  2. Updated multimodal field configuration to include image_input_idx in the processing pipeline
  3. Reimplemented patch reordering logic in _process_image_input() method
  4. Added fallback behavior when image_input_idx is unavailable to maintain backward compatibility

Test Plan

Test Script:

from vllm import LLM
from vllm.sampling_params import SamplingParams
import requests
from PIL import Image

model = LLM(
    model="allenai/Molmo-7B-D-0924",
    trust_remote_code=True,
    dtype='bfloat16',
    gpu_memory_utilization=0.95,
)

sampling_params = SamplingParams(max_tokens=64, temperature=0)

image_url = "https://www.visitscotland.com/binaries/content/gallery/visitscotland/cms-images/2022/06/24/clashnessie-bay-car-road"
image = Image.open(requests.get(image_url, stream=True).raw)

inputs = [{
    "prompt": "Point to the car.",
    "multi_modal_data": {"image": image},
}]

outputs = model.generate(inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Test Results

Before fix (incorrect coordinates):

<point x="84.0" y="51.5" alt="car">car</point>

After fix (correct coordinates):

<point x="69.0" y="48.3" alt="car">car</point>

The fixed coordinates x="69.0" y="48.3" now align with the expected output from the reference Transformers implementation, representing the correct spatial location of the car in the image.

Verification

  • ✅ Coordinates now match reference implementation (HuggingFace Transformers)
  • ✅ Backward compatibility maintained via optional image_input_idx field
  • ✅ No breaking changes to existing API
  • ✅ Performance impact minimal (only applies when image_input_idx is available)

Related Issues


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link

github-actions bot commented Oct 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a spatial positioning bug in the Molmo model by reintroducing image_input_idx and the patch reordering logic. The changes are well-structured and include a fallback for backward compatibility. I have one suggestion to improve the efficiency and readability of the new patch reordering implementation.

Comment on lines +1566 to +1584
# Create output tensor with spatial ordering
max_idx = valid_indices.max().item()
output_features = torch.zeros(
max_idx + 1,
image_features.shape[-1],
dtype=image_features.dtype,
device=image_features.device,
)

# Place features at their spatial positions
output_features[valid_indices] = valid_features

# Return only the positions that have features
occupied_positions = torch.zeros(
max_idx + 1, dtype=torch.bool, device=image_features.device
)
occupied_positions[valid_indices] = True

return output_features[occupied_positions]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for reordering patches is a bit complex and potentially inefficient. It creates two large intermediate tensors (output_features and occupied_positions) with a size determined by max_idx, which could be large and consume unnecessary memory. A more direct and memory-efficient approach is to sort the valid features directly using their spatial indices. This improves both readability and performance.

        # Spatially reorder the valid features based on their indices.
        # Sorting is more direct and memory-efficient than creating large
        # intermediate tensors for scattering and then gathering.
        _, sort_indices = torch.sort(valid_indices)
        return valid_features[sort_indices]

Copy link
Contributor

@wwl2755 wwl2755 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixing! Left several comments.


# NOTE: In v1, inputs_embeds is always generated at model runner, this
# condition is for v0 compatibility.
elif inputs_embeds is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we don't need to maintain the code for v0 now.

height=target_height,
num_images=num_images,
overrides=image_overrides,
width=target_width, height=target_height, num_images=num_images
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to change this part? I think it should be irrelevant.

@DarkLight1337
Copy link
Member

Closing as superseded by #26563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Molmo produces incorrect outputs

3 participants