-
Notifications
You must be signed in to change notification settings - Fork 31k
Gemma3 #36658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gemma3 #36658
Conversation
… not sure if RoPE is right.
Co-authored-by: Joshua Lochner <[email protected]>
…arbitrary number of images in prompts
…e starting with BOS
Raushan address PR comments
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As reviewed before LGTM!
| ("fnet", "FNetForPreTraining"), | ||
| ("fsmt", "FSMTForConditionalGeneration"), | ||
| ("funnel", "FunnelForPreTraining"), | ||
| ("gemma3", "Gemma3ForConditionalGeneration"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be included in MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES as well? I can't load the multimodal variant using AutoModelForVision2Seq, which works for most multimodal models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLMs should be loaded with AutoModelForImageTextToText which is a new mapping we added for multimodal models. The old AutoModelForVision2Seq is supposed to work only for models like BLIP which are used to take bare images without instructions, and caption them
Since earlier we didn't have a specific mapping for VLMs, everything got dumped in Vision2Seq, sorry if it was confusing. New releases all will come under ImageTextToText + all older models support this mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the explanation!
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.