- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.7k
          [Bugfix] Fix missing post_layernorm in CLIP
          #8155
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these: 
 🚀 | 
post_layernorm in CLIPpost_layernorm in CLIP and sharded weight loading in SigLIP
      | The modification of siglip loading logic looks good to me. BTW, are the disabled  vllm/tests/models/test_llava.py Lines 23 to 27 in d331156 
 | 
| 
 Yes, that's how I discovered it. However even after fixing the weight loading, the model cannot be run because of the special preprocessing they use. It will be addressed in another PR. | 
post_layernorm in CLIP and sharded weight loading in SigLIPpost_layernorm in CLIP
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM but I left some comments! PTAL
| @DarkLight1337 "len(self.encoder.layers) < config.num_hidden_layers" may not always determine skip the "post_layer_norm" . If model uses the first-last hidden states and skip the "post_layer_norm", int this case "len(self.encoder.layers) == config.num_hidden_layers". | 
| 
 I see what you mean. Is there anything in the HF config that lets you toggle this behavior on or off though? | 
| 
 https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-ov-hf/blob/main/config.json | 
| 
 | 
| Nevermind, I see that the problem actually comes from how the features are extracted in the LLaVA-OneVision model. I think you may have to modify the existing encoders' forward method to decide whether to apply post layernorm when you develop Llava-OneVision in your PR. | 
| This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy? | 
| 
 We have several cases: 
 Just from the config, we can only distinguish between case (1) and case (2, 3). We could add another parameter to the vision encoder model to specify whether to load  | 
| 
 Yea, this is also what I meant by an internal mapping (to keep track which models need post layernorm, and which not), or as you suggested this can be just an extra parameter in each model's implementation (but the user won't be able to change it from the config). Either way, this is something vLLM "maintains" instead of relying on the config file. | 
| 
 Could you please update me on the progress? This will trigger bugs in LLaVA models with multi videos/images input. | 
| 
 
 | 
| Can you open a new issue so it can be easily tracked? Thanks and sorry for the delay, we have been quite swamped lately. | 
| 
 Sure. Thanks. | 
| 
 FYI. We can use #9186 to track it. | 
Signed-off-by: Alvant <[email protected]>
Signed-off-by: Amit Garg <[email protected]>
Signed-off-by: LeiWang1999 <[email protected]>
CLIP encoder currently assumes that
len(self.encoder.layers) < config.num_hidden_layerswhich may not always be the case.This is a follow-up to #8106.
It additionally fixes a potential issue in SigLIP where the weights are not sharded like in CLIP. cc @Isotr0pyFixed in #8269