Skip to content

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Jul 3, 2024

This PR further extends the multi-modal registry so that each model can specify its own maximum number of multi-modal tokens during memory profiling.

This replaces the function of the user-provided image_feature_size argument in vision language config that was recently removed by #6089.

@DarkLight1337 DarkLight1337 requested a review from ywang96 July 3, 2024 23:21
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_input_mapper()
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the relationship between register_max_tokens and register_dummy_data is a bit intricate. There needs to be certain level of consistency here. Hard to get right. Should we mention something here?

Copy link
Member Author

@DarkLight1337 DarkLight1337 Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently have a note in registry_dummy_data that mentions it should use the max number of tokens from each modality. Is that sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the two should be tied together for consistency - see my comment below in phi3v.py.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_input_mapper()
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the two should be tied together for consistency - see my comment below in phi3v.py.

+ (new_height // 336 + 1) * 12


def get_max_phi3v_image_tokens(ctx: InputContext):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_max_phi3v_image_tokens and dummy_data_for_phi3v are both based on dummy_height, dummy_width = 8000, 50, so we should make these constants to this file for consistency. I think this will suffice for the purpose of consistency for now, and in the future we can establish more structured protocol between multimodal feature size and dummy data.

Copy link
Member

@ywang96 ywang96 Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made #6146 to address this.

@ywang96 ywang96 merged commit ae96ef8 into vllm-project:main Jul 4, 2024
@DarkLight1337 DarkLight1337 deleted the max-tokens branch July 5, 2024 01:53
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants