[Bugfix] Fix missing `post_layernorm` in CLIP #8155

DarkLight1337 · 2024-09-04T12:45:57Z

CLIP encoder currently assumes that len(self.encoder.layers) < config.num_hidden_layers which may not always be the case.

This is a follow-up to #8106.

~~It additionally fixes a potential issue in SigLIP where the weights are not sharded like in CLIP. cc @Isotr0py~~ Fixed in #8269

github-actions · 2024-09-04T12:46:11Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Isotr0py · 2024-09-04T13:42:12Z

The modification of siglip loading logic looks good to me.

BTW, are the disabled "TIGER-Lab/Mantis-8B-siglip-llama3" model test also related to this post_layernorm issue? I saw it was disabled before.

vllm/tests/models/test_llava.py

Lines 23 to 27 in d331156

    
           models = [ 
        
               "llava-hf/llava-1.5-7b-hf", 
        
               # TODO: Get this model to produce meaningful output in vLLM 
        
               # "TIGER-Lab/Mantis-8B-siglip-llama3", 
        
           ]

DarkLight1337 · 2024-09-04T14:22:46Z

The modification of siglip loading logic looks good to me.

BTW, are the disabled "TIGER-Lab/Mantis-8B-siglip-llama3" model test also related to this post_layernorm issue? I saw it was disabled before.

vllm/tests/models/test_llava.py

Lines 23 to 27 in d331156

models = [

"llava-hf/llava-1.5-7b-hf",

# TODO: Get this model to produce meaningful output in vLLM

# "TIGER-Lab/Mantis-8B-siglip-llama3",

]

Yes, that's how I discovered it. However even after fixing the weight loading, the model cannot be run because of the special preprocessing they use. It will be addressed in another PR.

ywang96

Overall LGTM but I left some comments! PTAL

vllm/model_executor/models/siglip.py

vllm/model_executor/models/clip.py

vllm/model_executor/models/siglip.py

litianjian · 2024-09-14T03:14:50Z

@DarkLight1337 "len(self.encoder.layers) < config.num_hidden_layers" may not always determine skip the "post_layer_norm" . If model uses the first-last hidden states and skip the "post_layer_norm", int this case "len(self.encoder.layers) == config.num_hidden_layers".

DarkLight1337 · 2024-09-14T03:17:15Z

@DarkLight1337 "len(self.encoder.layers) < config.num_hidden_layers" may not always determine skip the "post_layer_norm" . If model uses the first-last hidden states and skip the "post_layer_norm", int this case "len(self.encoder.layers) == config.num_hidden_layers".

I see what you mean. Is there anything in the HF config that lets you toggle this behavior on or off though?

litianjian · 2024-09-14T03:20:34Z

@DarkLight1337 "len(self.encoder.layers) < config.num_hidden_layers" may not always determine skip the "post_layer_norm" . If model uses the first-last hidden states and skip the "post_layer_norm", int this case "len(self.encoder.layers) == config.num_hidden_layers".

I see what you mean. Is there anything in the HF config that lets you toggle this behavior on or off though?

https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-ov-hf/blob/main/config.json
I am developing the llava-ov models on vLLM. This config triggers this issue.

DarkLight1337 · 2024-09-14T03:31:52Z

~~Looks like SigLIP has vision_config.use_vision_head setting. Let me open a PR for this.~~

DarkLight1337 · 2024-09-14T03:40:51Z

Nevermind, I see that the problem actually comes from how the features are extracted in the LLaVA-OneVision model. I think you may have to modify the existing encoders' forward method to decide whether to apply post layernorm when you develop Llava-OneVision in your PR.

ywang96 · 2024-09-14T03:53:24Z

This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy?

DarkLight1337 · 2024-09-14T03:59:04Z

This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy?

We have several cases:

The output state of an intermediate layer is used without layer normalization
The output state of the last layer is used without layer normalization
The output state of the last layer is used with layer normalization

Just from the config, we can only distinguish between case (1) and case (2, 3). We could add another parameter to the vision encoder model to specify whether to load post_layernorm, which further distinguishes between case (2) and case (3). Since the vision encoder could be initialized from the VLM, this lets each VLM control whether to use post_layernorm.

ywang96 · 2024-09-14T04:33:32Z

This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy?

We have several cases:

The output state of an intermediate layer is used without layer normalization

The output state of the last layer is used without layer normalization

The output state of the last layer is used with layer normalization

Just from the config, we can only distinguish between case (1) and case (2, 3). We could add another parameter to the vision encoder model to specify whether to load post_layernorm, which further distinguishes between case (2) and case (3). Since the vision encoder could be initialized from the VLM, this lets each VLM control whether to use post_layernorm.

Yea, this is also what I meant by an internal mapping (to keep track which models need post layernorm, and which not), or as you suggested this can be just an extra parameter in each model's implementation (but the user won't be able to change it from the config). Either way, this is something vLLM "maintains" instead of relying on the config file.

litianjian · 2024-10-09T07:25:17Z

This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy?

We have several cases:

The output state of an intermediate layer is used without layer normalization

The output state of the last layer is used without layer normalization

The output state of the last layer is used with layer normalization

Just from the config, we can only distinguish between case (1) and case (2, 3). We could add another parameter to the vision encoder model to specify whether to load post_layernorm, which further distinguishes between case (2) and case (3). Since the vision encoder could be initialized from the VLM, this lets each VLM control whether to use post_layernorm.

Yea, this is also what I meant by an internal mapping (to keep track which models need post layernorm, and which not), or as you suggested this can be just an extra parameter in each model's implementation (but the user won't be able to change it from the config). Either way, this is something vLLM "maintains" instead of relying on the config file.

Could you please update me on the progress? This will trigger bugs in LLaVA models with multi videos/images input.

DarkLight1337 · 2024-10-09T07:31:35Z

This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy?

We have several cases:

The output state of an intermediate layer is used without layer normalization

The output state of the last layer is used without layer normalization

The output state of the last layer is used with layer normalization

Just from the config, we can only distinguish between case (1) and case (2, 3). We could add another parameter to the vision encoder model to specify whether to load post_layernorm, which further distinguishes between case (2) and case (3). Since the vision encoder could be initialized from the VLM, this lets each VLM control whether to use post_layernorm.

Yea, this is also what I meant by an internal mapping (to keep track which models need post layernorm, and which not), or as you suggested this can be just an extra parameter in each model's implementation (but the user won't be able to change it from the config). Either way, this is something vLLM "maintains" instead of relying on the config file.

Could you please update me on the progress? This will trigger bugs in LLaVA models with multi videos/images input.

~~This should be fixed already~~ Sorry, just checked and it's not solved yet.

DarkLight1337 · 2024-10-09T07:36:00Z

Can you open a new issue so it can be easily tracked? Thanks and sorry for the delay, we have been quite swamped lately.

litianjian · 2024-10-09T07:36:37Z

Can you open a new issue so it can be easily tracked? Thanks and sorry for the delay, we have been quite swamped lately.

Sure. Thanks.

litianjian · 2024-10-10T02:13:30Z

Can you open a new issue so it can be easily tracked? Thanks and sorry for the delay, we have been quite swamped lately.

FYI. We can use #9186 to track it.

Signed-off-by: Alvant <[email protected]>

Signed-off-by: Amit Garg <[email protected]>

Signed-off-by: LeiWang1999 <[email protected]>

Fix missing post_layernorm in CLIP

4811fbc

DarkLight1337 requested a review from ywang96 September 4, 2024 12:46

Fix missing shard weight loading in SigLIP

26feda6

DarkLight1337 changed the title ~~[Bugfix] Fix missing post_layernorm in CLIP~~ [Bugfix] Fix missing post_layernorm in CLIP and sharded weight loading in SigLIP Sep 4, 2024

DarkLight1337 added 2 commits September 7, 2024 08:40

Merge branch 'upstream' into fix-clip-layernorm

33869b5

Merge branch 'upstream' into fix-clip-layernorm

5430a25

DarkLight1337 changed the title ~~[Bugfix] Fix missing post_layernorm in CLIP and sharded weight loading in SigLIP~~ [Bugfix] Fix missing post_layernorm in CLIP Sep 8, 2024

ywang96 reviewed Sep 10, 2024

View reviewed changes

vllm/model_executor/models/siglip.py Outdated Show resolved Hide resolved

vllm/model_executor/models/clip.py Outdated Show resolved Hide resolved

vllm/model_executor/models/clip.py Outdated Show resolved Hide resolved

vllm/model_executor/models/siglip.py Show resolved Hide resolved

Address comments

ad2ac42

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2024

Merge branch 'upstream' into fix-clip-layernorm

fcb2dbf

ywang96 approved these changes Sep 10, 2024

View reviewed changes

ywang96 enabled auto-merge (squash) September 10, 2024 06:27

ywang96 merged commit da1a844 into vllm-project:main Sep 10, 2024
50 checks passed

DarkLight1337 deleted the fix-clip-layernorm branch September 10, 2024 08:23

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Sep 12, 2024

[Bugfix] Fix missing post_layernorm in CLIP (vllm-project#8155)

657bfac

DarkLight1337 mentioned this pull request Sep 14, 2024

[Bugfix] remove post_layernorm in siglip #8106

Merged

litianjian mentioned this pull request Sep 14, 2024

[Model][VLM] Add LLaVA-Onevision model support #8486

Merged

3 tasks

litianjian mentioned this pull request Oct 9, 2024

[Feature]: Output state configuration of vision encoder In VLM #9186

Closed

1 task

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Bugfix] Fix missing post_layernorm in CLIP (vllm-project#8155)

80e4a46

Signed-off-by: Alvant <[email protected]>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[Bugfix] Fix missing post_layernorm in CLIP (vllm-project#8155)

25a44da

Signed-off-by: Amit Garg <[email protected]>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Bugfix] Fix missing post_layernorm in CLIP (vllm-project#8155)

adb8078

Signed-off-by: LeiWang1999 <[email protected]>

Uh oh!

[Bugfix] Fix missing post_layernorm in CLIP #8155

[Bugfix] Fix missing post_layernorm in CLIP #8155

Uh oh!

Conversation

DarkLight1337 commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 4, 2024

Uh oh!

Isotr0py commented Sep 4, 2024

Uh oh!

DarkLight1337 commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

litianjian commented Sep 14, 2024

Uh oh!

DarkLight1337 commented Sep 14, 2024

Uh oh!

litianjian commented Sep 14, 2024

Uh oh!

DarkLight1337 commented Sep 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Sep 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywang96 commented Sep 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Sep 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywang96 commented Sep 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

litianjian commented Oct 9, 2024

Uh oh!

DarkLight1337 commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Oct 9, 2024

Uh oh!

litianjian commented Oct 9, 2024

Uh oh!

litianjian commented Oct 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Bugfix] Fix missing `post_layernorm` in CLIP #8155

[Bugfix] Fix missing `post_layernorm` in CLIP #8155

DarkLight1337 commented Sep 4, 2024 •

edited

Loading

DarkLight1337 commented Sep 4, 2024 •

edited

Loading

DarkLight1337 commented Sep 14, 2024 •

edited

Loading

DarkLight1337 commented Sep 14, 2024 •

edited

Loading

ywang96 commented Sep 14, 2024 •

edited

Loading

DarkLight1337 commented Sep 14, 2024 •

edited

Loading

ywang96 commented Sep 14, 2024 •

edited

Loading

DarkLight1337 commented Oct 9, 2024 •

edited

Loading