-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[Bugfix] remove post_layernorm in siglip #8106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
Can you add a test or modify the existing tests to be stricter so that this behaviour is checked? Thanks! |
Sorry, I have no clue about adding relevant test cases. I think this is a small modification, and the tensor shape is the same before and after the modification. I don't know how to add test examples for this change, can you give me some specific tips? |
How did you discover this bug? You can add a test for the model that triggered it by updating the corresponding file under |
Thanks for your reply, I found this problem on a model I trained myself. There is no model for siglip as a vision encoder in huggingface. As a result, I couldn't test directly with llava/ llava-next. Another way is to test the siglip model directly, but testing the vision model alone is not currently supported. Maybe I should write a new test case that supports visual encoder testing? |
Ah I see, I thought it was a problem with one of our existing models. In that case there is no need to add the test, thanks for fixing! |
Thank you very much!🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm actually, I tried this locally and it broke the tests. The existing vision models expect post_layernorm
to be used. I think you'll have to add an optional argument to the forward
method so that post_layernorm
can be skipped specifically in your case.
Edit: See below message.
This is because PaliGemma uses the last hidden state, whereas LLaVA models use the second-last hidden state. Only the last hidden state in So, the real fix would be to apply |
In the current llava based siglip model, If necessary, I can add a parameter for selection. |
We have to fix both CLIP and SigLIP encoders:
|
But |
Thank you for your patient response. I have revised my submission. Now determine whether the last layer is used by If the last layer is used, |
Tests are failing. You need to update the weight loading logic as well. |
thx! I have fixed the problem. I added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass so I'm approving this, thanks for the fix!
Thank you for your patient guidance! |
@wnma3mz @DarkLight1337 "post_layer_norm" is not used in LLaVA model. But the changes in this issue may not solve this problem. Because "self_num_hidden_layers_override" is used to determine the layers of encoder, not for the "post_layer_norm". |
Please see #8155 |
Signed-off-by: Alvant <[email protected]>
Signed-off-by: LeiWang1999 <[email protected]>
No issues have been raised at present, but it is a known bug.
When siglip acts as a multimodal vision encoder, there is no need for a post layer norm, which would otherwise result in unexpected output.
comment
post_layer_norm
In the LLaVA-Next code implementation,
post_layernorm
is not used.It just used
encoder_outputs
.https://github.com/LLaVA-VL/LLaVA-NeXT/blob/48890b0cb5da882ab584689244e74802ddbd4f75/llava/model/multimodal_encoder/siglip_encoder.py#L576-L587