-
Notifications
You must be signed in to change notification settings - Fork 31k
Add Molmo (7B-D, 7B-O) #33962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
molbap
wants to merge
209
commits into
huggingface:main
Choose a base branch
from
molbap:add_molmo
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add Molmo (7B-D, 7B-O) #33962
Changes from all commits
Commits
Show all changes
209 commits
Select commit
Hold shift + click to select a range
dc6fcac
add base convert keys + chat template
molbap 574e01f
Merge branch 'main' into add_molmo
molbap 0bd413b
draft: add up modular files for molmo
molbap 9e454e4
Squashed commit of the following:
molbap d82c471
sync changes
molbap 339a8d3
push a simple fix
ArthurZucker c0c25d6
finish fixing
ArthurZucker 5ee6a44
Merge branch 'main' into add_molmo
molbap 33e43ec
suppress diff
molbap d23e1c1
Merge branch 'main' into add_molmo
molbap c8c12fe
fix
ArthurZucker 0909c02
style
ArthurZucker 1799d20
add config + 2d pooling
molbap fb133d4
suppress changes
molbap 5ba4105
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap a2a6a9b
fix
ArthurZucker 8fe7a9f
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
ArthurZucker 20681f5
conversion works :raised_hands:
molbap c85af98
fixup
molbap 35ea3cc
handle missing MOLMO_VISION_ATTENTION_CLASSES
molbap ab79d0e
fix
molbap b9bdf99
fix fused keys mismatch
molbap 98d5ccd
fix
molbap 3bca742
[Modular-breaking] add manually vision attention classes list
molbap a13fe05
finish weight conversion script
molbap fac8dfd
add more keys
molbap c1e5f19
flipped the linear layers
molbap a68e5f5
add pooling forward + draft general forward
molbap 8298b80
modeling file with swiglu, forward(input_ids) passing
molbap 9f69c6b
BIG push of image processor
molbap 0711e08
add missing objects to init
molbap 7efe22e
Merge branch 'main' into add_molmo
molbap f5bd3b0
fix up wrong channel dimension
molbap 3ae884f
fix typo
molbap 3ef60c0
add missing image token indices used in forward
molbap cf9d4ab
pad patch orderings
molbap 91a2d3c
clean up conversion script
molbap 0f7904f
remind that tests are TODO
molbap 577e347
merge main
zucchini-nlp b514041
at least it runs like this
zucchini-nlp cf6cb5d
add bos token
molbap 26c517d
add bos token in prompt
molbap 35c168d
fix processor, missing batching img_mask
molbap e7275c7
fix image masks + batching
molbap 3e7530d
working version
zucchini-nlp 4bbc89b
+1 only on non masked indices
zucchini-nlp 54e072b
attemp 1 to make modular work
zucchini-nlp 1e99752
update conversion to fit all ckpt + chat template + clean up a bit
zucchini-nlp 92a1f31
fix processing tests
zucchini-nlp 42330e0
add more tests (failing for now)
zucchini-nlp 932f6d1
fix the conversion
zucchini-nlp aafb827
done!
zucchini-nlp 36cc6dd
nit
zucchini-nlp f399c3a
some tests are failing, coming back tomorrow
zucchini-nlp 7322227
adapt to any image format
molbap e4db50a
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap 205a755
try to get batched generation working
molbap eb61617
fix other tests, should work now
zucchini-nlp b77d947
adjust test for batching
zucchini-nlp ba4dd50
little bit of style
zucchini-nlp 0e2d184
docs + imports + automapping
zucchini-nlp 9a83706
remove images kwargs
zucchini-nlp 171eb8e
some unused config attributes
zucchini-nlp 35b517a
remove additional vocab size and pad lm head
zucchini-nlp 6a0cbc5
remove einops dependency
molbap 5c7b141
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap 434d4b1
dont skip these tests
zucchini-nlp 4645f97
format + add integration testing
molbap 48f2e21
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap 4bb4e48
fix tests + fix 72B conversion
molbap e676782
fix format
molbap a74bda2
modualr kinda works but adds extra classes like `VisionVisionModel` :(
zucchini-nlp 2c428ae
accomodate 7B-O version as well (broken)
molbap d338153
merge, fix conflicts and clean up modular extra code
molbap 00376c4
fix 7B-O
zucchini-nlp 48354fe
remove unused code path
zucchini-nlp d738493
nit
zucchini-nlp d0e90d4
make modular work mostly
zucchini-nlp f06b6d9
fix imports
zucchini-nlp 9fc25c0
update modulat last time
zucchini-nlp 38dc9e8
fix copies
zucchini-nlp eb77f3c
fix copies
zucchini-nlp 190cc35
fix tests
zucchini-nlp 84ed244
initial push of fast processor
molbap b4d48d5
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap 1298d08
Merge branch 'main' into add_molmo
molbap 6687d43
fix various issues + tests
molbap 5f79577
add Molmo submodules as private
molbap 9e72758
do not test submodules
molbap 439aed6
[run-slow] molmo
molbap 5a6a965
underscore prefixed method is not public
molbap b9746a8
fix tests
molbap 2090ed6
fix docs
molbap 8ad3a25
[run-slow] molmo
molbap 0d10ee4
Merge branch 'main' into add_molmo
molbap 9bd96f5
fix cache shape
molbap af5468b
[run-slow] molmo
molbap c02c6de
trigger CI
molbap 5f35055
mark flaky test
molbap 2b7af87
add missing objects
molbap 9f0f09d
add config to init
molbap 74ebb24
more init fixes
molbap 8b00c44
fix style
molbap d6403ad
fix?
molbap eb43cb9
fix
molbap 33f0624
what is this again
molbap cc59007
Merge branch 'main' into add_molmo
molbap 23ae692
is this real life
molbap 4c456e7
it was real life, fix broken eager
molbap 91f2820
fix attribtues
molbap e2df6bc
this attention should be fixed
molbap ae77cc6
set 7b test to bf16
molbap 166b28a
[run-slow] molmo
molbap 50bcb7c
Merge branch 'main' into add_molmo
molbap bf012d8
[run-slow] molmo
molbap 6e0634b
fix text (variability T4/A100)
molbap 8569fd0
push clean Fast (x3!) image processor
molbap fd401bc
Merge branch 'main' into add_molmo
molbap 86acf22
fix modular changes from main
molbap 1ebea3c
Merge branch 'main' into add_molmo
molbap 5ebc6f0
push fast image proc with device check
molbap 19d2689
push fast image proc with device check
molbap c652bb9
format
molbap 50c21e5
images kwargs were missing
molbap 092da76
merge and fix conflicts
molbap 1254eac
style
molbap bd39143
update with modular conversion
molbap 3efcb13
add torch import
molbap 56ae76f
style
molbap 9417ff7
protect import
molbap 51f9336
fix modular
molbap 3719481
Merge branch 'main' into add_molmo
molbap f394b02
cherry-pick: cohere (from 67c3fcd4f32c64e07f302f00243be7d54914d78b)
molbap e418aa3
fix modular with cohere interface
molbap 5af0b57
fixup cohere all imports
molbap a574b93
fix bf16 test output
molbap 9f3018d
fix
molbap e2d1ba8
style
molbap c872095
Merge branch 'main' into add_molmo
molbap 41ab3a7
uniformize fast image processor
molbap dd74b78
Merge branch 'main' into add_molmo
molbap d052666
fix merge
molbap 0a822f4
unbloat modular a tad
molbap 8ebf44f
fix import
molbap 4e6070f
fix modular
molbap a8758bf
remove print :eyes:
molbap 64c2ae8
Merge branch 'main' into add_molmo
molbap 0e69cda
call correct qk norm
molbap 279729d
Merge branch 'main' into add_molmo
molbap 3afdd77
remove forward last hook debug
molbap 4df5c1a
fix qk norms, order of operations, etc
molbap f16e404
format
molbap ed891f7
fix modular
molbap b939817
fixup modular (some rebasing needed)
molbap 6f480be
downstream debugger changes
molbap 4eaff6a
likely rebase errors
molbap 73699ea
format
molbap 638a568
fixup modeling test
molbap 8ff9df1
make sure to process images only when images are present
molbap 0e97e08
fix fused qk norms
molbap be9b810
broken modular, qknorm was unfused in cohere
molbap b0213e4
Merge branch 'main' into add_molmo
molbap 1f0fc3e
typo
molbap 562a889
small cleanup
molbap 61d6a4a
Merge branch 'main' into add_molmo
molbap fe970df
simplify molmo vision with clip refactor
molbap bf578e3
style
molbap 6332eae
carried over typo after init merging
molbap 73c8233
Merge branch 'main' into add_molmo
molbap 15f7c05
better kv groups
molbap c0de7ba
refix
molbap 87f069e
Merge branch 'main' into add_molmo
molbap 69929a3
wrong ruff version :no_mouth:
molbap 574e304
ruff again
molbap 324f1be
Update docs/source/en/model_doc/molmo.md
molbap 7d55b7a
Update docs/source/en/model_doc/molmo.md
molbap 5ab00b3
Merge branch 'main' into add_molmo
molbap 51b36c7
Merge branch 'add_molmo' of github.com:molbap/transformers into add_m…
molbap fe1e2e8
update
molbap c8f9553
merge issue
molbap ff1862e
rebase
molbap 4770401
wrong stash pop
molbap caf6257
left padding, chat template, and wrong pad token
molbap 53a5801
add docs
molbap fd417ae
remove debug, fix left-padded batched generation :warning_sign:
molbap cc91650
fixes
molbap fdcfadd
style
molbap d159c2a
fixup config
molbap 39a78d7
woops
molbap 025c075
clean up a bit
molbap 5641b62
clean up
molbap 886778b
Merge branch 'main' into add_molmo
molbap 3d2f6d9
separate head from model
molbap d7f89a2
happify CI
molbap 7766001
more prettifying +docs
molbap a89dbac
fixups
molbap fc9ea4f
update doc
molbap f642ade
remove vision2seq
molbap 4b62b00
minor changes doc + format
molbap 3ff333e
Merge branch 'main' into add_molmo
molbap dbd47b4
fixup
molbap 6d536f6
fixes after main merge
molbap 8b54db9
apply remainder of code review
molbap 9f84789
Merge branch 'main' into add_molmo
molbap a04f709
Merge branch 'main' into add_molmo
molbap 1da9ab5
fixup
molbap f697e67
blindly upstream
molbap 6f7b6e4
update fast proc
molbap e1326a1
kickstart
molbap File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
|
|
||
| # Molmo | ||
|
|
||
| ## Overview | ||
|
|
||
| The Molmo model was proposed in [Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models]([https://arxiv.org/abs/2409.17146]) by Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi. | ||
|
|
||
| Molmo, developed by AllenAI team, is an open-source multimodal AI model capable of processing text and images within a unified framework. It outperforms larger models in efficiency and accuracy, leveraging high-quality datasets like PixMo for tasks such as captioning, question answering, and visual pointing. | ||
|
|
||
| The abstract from the paper is the following: | ||
|
|
||
| *Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. | ||
| * | ||
|
|
||
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/molmo_arch.png" | ||
| alt="drawing" width="600"/> | ||
|
|
||
| <small> Molmo incorporates images by encoding various patches of the input image. Taken from the <a href="https://arxiv.org/abs/2409.17146">original paper.</a> </small> | ||
|
|
||
|
|
||
| Tips: | ||
|
|
||
| - We recommend calling `processor.tokenizer.padding_side = "left"` for batched generation because it leads to more accurate results. | ||
|
|
||
|
|
||
| This model was contributed by [Molbap](https://huggingface.co/Molbap). | ||
|
|
||
|
|
||
| ## Usage example | ||
|
|
||
| ### Single image inference | ||
|
|
||
| Here's how to load the model and perform inference in half-precision (`torch.float16`): | ||
|
|
||
| ```python | ||
| from transformers import MolmoForConditionalGeneration, AutoProcessor | ||
| import torch | ||
| from PIL import Image | ||
| import requests | ||
|
|
||
| model = MolmoForConditionalGeneration.from_pretrained("allenai/Molmo-7B-D-hf", torch_dtype="float16", device_map="auto") | ||
| processor = AutoProcessor.from_pretrained("allenai/Molmo-7B-D-hf") | ||
|
|
||
|
|
||
| conversation = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image", "url": "https://picsum.photos/id/237/536/354"}, | ||
| {"type": "text", "text": "What is shown in this image?"}, | ||
| ], | ||
| }, | ||
| ] | ||
| inputs = processor.apply_chat_template( | ||
| conversation, | ||
| tokenize=True, | ||
| return_dict=True, | ||
| add_generation_prompt=True | ||
| ).to(model.device) | ||
|
|
||
| output = model.generate(**inputs, max_new_tokens=100) | ||
|
|
||
| print(processor.decode(output[0], skip_special_tokens=True)) | ||
| ``` | ||
|
|
||
|
|
||
| ## MolmoConfig | ||
|
|
||
| [[autodoc]] MolmoConfig | ||
|
|
||
| ## MolmoTextConfig | ||
|
|
||
| [[autodoc]] MolmoTextConfig | ||
|
|
||
| ## MolmoVisionConfig | ||
|
|
||
| [[autodoc]] MolmoVisionConfig | ||
|
|
||
| ## MolmoPoolingConfig | ||
|
|
||
| [[autodoc]] MolmoPoolingConfig | ||
|
|
||
| ## MolmoImageProcessor | ||
|
|
||
| [[autodoc]] MolmoImageProcessor | ||
|
|
||
| ## MolmoImageProcessorFast | ||
|
|
||
| [[autodoc]] MolmoImageProcessorFast | ||
|
|
||
| ## MolmoProcessor | ||
|
|
||
| [[autodoc]] MolmoProcessor | ||
|
|
||
| ## MolmoAdapterModel | ||
|
|
||
| [[autodoc]] MolmoAdapterModel | ||
| - forward | ||
|
|
||
| ## MolmoModel | ||
|
|
||
| [[autodoc]] MolmoModel | ||
| - forward | ||
|
|
||
| ## MolmoTextModel | ||
|
|
||
| [[autodoc]] MolmoTextModel | ||
| - forward | ||
|
|
||
| ## MolmoVisionModel | ||
|
|
||
| [[autodoc]] MolmoVisionModel | ||
| - forward | ||
|
|
||
| ## MolmoForCausalLM | ||
|
|
||
| [[autodoc]] MolmoForCausalLM | ||
| - forward | ||
|
|
||
| ## MolmoForConditionalGeneration | ||
|
|
||
| [[autodoc]] MolmoForConditionalGeneration | ||
| - forward | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| # Copyright 2025 The HuggingFace Team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| from ...utils import _LazyModule | ||
| from ...utils.import_utils import define_import_structure | ||
|
|
||
|
|
||
| if TYPE_CHECKING: | ||
| from .configuration_molmo import * | ||
| from .image_processing_molmo import * | ||
| from .image_processing_molmo_fast import * | ||
| from .modeling_molmo import * | ||
| from .processing_molmo import * | ||
| else: | ||
| import sys | ||
|
|
||
| _file = globals()["__file__"] | ||
| sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.