[Bugfix] remove post_layernorm in siglip #8106

wnma3mz · 2024-09-03T09:15:03Z

No issues have been raised at present, but it is a known bug.

What is the bug

When siglip acts as a multimodal vision encoder, there is no need for a post layer norm, which would otherwise result in unexpected output.

What is the fix

comment post_layer_norm

Why was the original implementation wrong?

In the LLaVA-Next code implementation, post_layernorm is not used.

It just used encoder_outputs.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/48890b0cb5da882ab584689244e74802ddbd4f75/llava/model/multimodal_encoder/siglip_encoder.py#L576-L587

github-actions · 2024-09-03T09:15:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 · 2024-09-03T09:22:04Z

Can you add a test or modify the existing tests to be stricter so that this behaviour is checked? Thanks!

wnma3mz · 2024-09-03T09:58:30Z

Can you add a test or modify the existing tests to be stricter so that this behaviour is checked? Thanks!

Sorry, I have no clue about adding relevant test cases.

I think this is a small modification, and the tensor shape is the same before and after the modification.

I don't know how to add test examples for this change, can you give me some specific tips?

DarkLight1337 · 2024-09-03T10:06:03Z

How did you discover this bug? You can add a test for the model that triggered it by updating the corresponding file under tests/models.

wnma3mz · 2024-09-03T10:51:16Z

How did you discover this bug? You can add a test for the model that triggered it by updating the corresponding file under tests/models.

Thanks for your reply, I found this problem on a model I trained myself. There is no model for siglip as a vision encoder in huggingface. As a result, I couldn't test directly with llava/ llava-next.

Another way is to test the siglip model directly, but testing the vision model alone is not currently supported. Maybe I should write a new test case that supports visual encoder testing?

DarkLight1337 · 2024-09-03T11:07:14Z

Ah I see, I thought it was a problem with one of our existing models. In that case there is no need to add the test, thanks for fixing!

wnma3mz · 2024-09-03T11:09:54Z

Ah I see, I thought it was a problem with one of our existing models. In that case there is no need to add the test, thanks for fixing!

Thank you very much!🙏

DarkLight1337

Hmm actually, I tried this locally and it broke the tests. The existing vision models expect post_layernorm to be used. ~~I think you'll have to add an optional argument to the forward method so that post_layernorm can be skipped specifically in your case.~~

Edit: See below message.

DarkLight1337 · 2024-09-03T11:20:16Z

This is because PaliGemma uses the last hidden state, whereas LLaVA models use the second-last hidden state. Only the last hidden state in transformers library is subject to post_layernorm. Therefore, the current implementation of CLIP in vLLM will also break if the last feature layer is selected (since post_layernorm isn't implemented).

So, the real fix would be to apply post_layernorm for each visual encoder in vLLM only if all of the encoder layers are used.

wnma3mz · 2024-09-03T11:23:26Z

Hmm actually, I tried this locally and it broke the tests. The existing vision models expect post_layernorm to be used. I think you'll have to add an optional argument to the forward method so that post_layernorm can be skipped specifically in your case.

In the current llava based siglip model, post_layernorm is not used. It's like not using the head.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/48890b0cb5da882ab584689244e74802ddbd4f75/llava/model/multimodal_encoder/siglip_encoder.py#L576-L587

If necessary, I can add a parameter for selection.

DarkLight1337 · 2024-09-03T11:29:18Z

We have to fix both CLIP and SigLIP encoders:

CLIP should load post_layernorm if the last feature layer is selected. Currently, it always omits the layer.
SigLIP should omit post_layernorm if the last feature layer is selected. Currently, it always loads the layer.

wnma3mz · 2024-09-03T11:32:31Z

But post_layernorm does not appear in ClipVisionTransformer. if possible, I could make similar modifications to clip.

wnma3mz · 2024-09-03T12:58:54Z

This is because PaliGemma uses the last hidden state, whereas LLaVA models use the second-last hidden state. Only the last hidden state in transformers library is subject to post_layernorm. Therefore, the current implementation of CLIP in vLLM will also break if the last feature layer is selected (since post_layernorm isn't implemented).

So, the real fix would be to apply post_layernorm for each visual encoder in vLLM only if all of the encoder layers are used.

Thank you for your patient response. I have revised my submission.

Now determine whether the last layer is used by self.num_hidden_layers_override.

If the last layer is used, self_num_hidden_layers_override is None, it will use post_layer_norm; Otherwise, post_layer_norm will be skipped.

DarkLight1337 · 2024-09-03T14:46:40Z

Tests are failing. You need to update the weight loading logic as well.

wnma3mz · 2024-09-03T21:48:36Z

Tests are failing. You need to update the weight loading logic as well.

thx! I have fixed the problem.

I added need_post_layernorm to control whether post_layernorm weights are initialized and loaded

vllm/model_executor/models/siglip.py

DarkLight1337

Tests pass so I'm approving this, thanks for the fix!

wnma3mz · 2024-09-04T11:03:30Z

Thank you for your patient guidance!

litianjian · 2024-09-14T03:01:02Z

This is because PaliGemma uses the last hidden state, whereas LLaVA models use the second-last hidden state. Only the last hidden state in transformers library is subject to post_layernorm. Therefore, the current implementation of CLIP in vLLM will also break if the last feature layer is selected (since post_layernorm isn't implemented).
So, the real fix would be to apply post_layernorm for each visual encoder in vLLM only if all of the encoder layers are used.

Thank you for your patient response. I have revised my submission.

Now determine whether the last layer is used by self.num_hidden_layers_override.

If the last layer is used, self_num_hidden_layers_override is None, it will use post_layer_norm; Otherwise, post_layer_norm will be skipped.

@wnma3mz @DarkLight1337 "post_layer_norm" is not used in LLaVA model. But the changes in this issue may not solve this problem. Because "self_num_hidden_layers_override" is used to determine the layers of encoder, not for the "post_layer_norm".

DarkLight1337 · 2024-09-14T03:02:54Z

This is because PaliGemma uses the last hidden state, whereas LLaVA models use the second-last hidden state. Only the last hidden state in transformers library is subject to post_layernorm. Therefore, the current implementation of CLIP in vLLM will also break if the last feature layer is selected (since post_layernorm isn't implemented).
So, the real fix would be to apply post_layernorm for each visual encoder in vLLM only if all of the encoder layers are used.

Thank you for your patient response. I have revised my submission.
Now determine whether the last layer is used by self.num_hidden_layers_override.
If the last layer is used, self_num_hidden_layers_override is None, it will use post_layer_norm; Otherwise, post_layer_norm will be skipped.

@wnma3mz @DarkLight1337 "post_layer_norm" is not used in LLaVA model. But the changes in this issue may not solve this problem. Because "self_num_hidden_layers_override" is used to determine the layers of encoder, not for the "post_layer_norm".

Please see #8155

Signed-off-by: Alvant <[email protected]>

Signed-off-by: LeiWang1999 <[email protected]>

[Bugfix] remove post_layernorm in siglip

4a65e17

DarkLight1337 mentioned this pull request Sep 3, 2024

[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend #8061

Merged

4 tasks

DarkLight1337 enabled auto-merge (squash) September 3, 2024 11:07

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2024

DarkLight1337 approved these changes Sep 3, 2024

View reviewed changes

DarkLight1337 disabled auto-merge September 3, 2024 11:08

DarkLight1337 requested changes Sep 3, 2024

View reviewed changes

wnma3mz added 3 commits September 3, 2024 12:47

add post_layernorm when num_hidden_layers_override is None for siglip

229bdee

update num_hidden_layers_override equals config.num_hidden_layers

98b19bc

format code

2ec4f3e

wnma3mz requested a review from DarkLight1337 September 3, 2024 13:03

wnma3mz added 5 commits September 3, 2024 21:24

fix load_weights

b93318f

Merge branch 'main' into main

908d316

format code

459974b

format code

e06c4de

format code

4da0638

DarkLight1337 reviewed Sep 4, 2024

View reviewed changes

vllm/model_executor/models/siglip.py Outdated Show resolved Hide resolved

move need_post_layernorm to SiglipVisionTransformer

544ab65

DarkLight1337 reviewed Sep 4, 2024

View reviewed changes

vllm/model_executor/models/siglip.py Outdated Show resolved Hide resolved

make need_post_layernorm as a property

5ea1163

wnma3mz requested a review from DarkLight1337 September 4, 2024 10:40

DarkLight1337 approved these changes Sep 4, 2024

View reviewed changes

DarkLight1337 merged commit d331156 into vllm-project:main Sep 4, 2024

DarkLight1337 mentioned this pull request Sep 4, 2024

[Bugfix] Fix missing post_layernorm in CLIP #8155

Merged

litianjian mentioned this pull request Oct 9, 2024

[Feature]: Output state configuration of vision encoder In VLM #9186

Closed

1 task

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Bugfix] remove post_layernorm in siglip (vllm-project#8106)

577ce5e

Signed-off-by: Alvant <[email protected]>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Bugfix] remove post_layernorm in siglip (vllm-project#8106)

e580cd5

Signed-off-by: LeiWang1999 <[email protected]>

Uh oh!

[Bugfix] remove post_layernorm in siglip #8106

[Bugfix] remove post_layernorm in siglip #8106

Uh oh!

Conversation

wnma3mz commented Sep 3, 2024

Uh oh!

github-actions bot commented Sep 3, 2024

Uh oh!

DarkLight1337 commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wnma3mz commented Sep 3, 2024

Uh oh!

DarkLight1337 commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wnma3mz commented Sep 3, 2024

Uh oh!

DarkLight1337 commented Sep 3, 2024

Uh oh!

wnma3mz commented Sep 3, 2024

Uh oh!

DarkLight1337 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wnma3mz commented Sep 3, 2024

Uh oh!

DarkLight1337 commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wnma3mz commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wnma3mz commented Sep 3, 2024

Uh oh!

DarkLight1337 commented Sep 3, 2024

Uh oh!

wnma3mz commented Sep 3, 2024

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

wnma3mz commented Sep 4, 2024

Uh oh!

litianjian commented Sep 14, 2024

Uh oh!

DarkLight1337 commented Sep 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DarkLight1337 commented Sep 3, 2024 •

edited

Loading

DarkLight1337 commented Sep 3, 2024 •

edited

Loading

DarkLight1337 left a comment •

edited

Loading

DarkLight1337 commented Sep 3, 2024 •

edited

Loading

DarkLight1337 commented Sep 3, 2024 •

edited

Loading

wnma3mz commented Sep 3, 2024 •

edited

Loading