Skip to content

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Sep 4, 2024

CLIP encoder currently assumes that len(self.encoder.layers) < config.num_hidden_layers which may not always be the case.

This is a follow-up to #8106.

It additionally fixes a potential issue in SigLIP where the weights are not sharded like in CLIP. cc @Isotr0py Fixed in #8269

@github-actions
Copy link

github-actions bot commented Sep 4, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337 DarkLight1337 changed the title [Bugfix] Fix missing post_layernorm in CLIP [Bugfix] Fix missing post_layernorm in CLIP and sharded weight loading in SigLIP Sep 4, 2024
@Isotr0py
Copy link
Member

Isotr0py commented Sep 4, 2024

The modification of siglip loading logic looks good to me.

BTW, are the disabled "TIGER-Lab/Mantis-8B-siglip-llama3" model test also related to this post_layernorm issue? I saw it was disabled before.

models = [
"llava-hf/llava-1.5-7b-hf",
# TODO: Get this model to produce meaningful output in vLLM
# "TIGER-Lab/Mantis-8B-siglip-llama3",
]

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Sep 4, 2024

The modification of siglip loading logic looks good to me.

BTW, are the disabled "TIGER-Lab/Mantis-8B-siglip-llama3" model test also related to this post_layernorm issue? I saw it was disabled before.

models = [
"llava-hf/llava-1.5-7b-hf",
# TODO: Get this model to produce meaningful output in vLLM
# "TIGER-Lab/Mantis-8B-siglip-llama3",
]

Yes, that's how I discovered it. However even after fixing the weight loading, the model cannot be run because of the special preprocessing they use. It will be addressed in another PR.

@DarkLight1337 DarkLight1337 changed the title [Bugfix] Fix missing post_layernorm in CLIP and sharded weight loading in SigLIP [Bugfix] Fix missing post_layernorm in CLIP Sep 8, 2024
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM but I left some comments! PTAL

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2024
@ywang96 ywang96 enabled auto-merge (squash) September 10, 2024 06:27
@ywang96 ywang96 merged commit da1a844 into vllm-project:main Sep 10, 2024
50 checks passed
@DarkLight1337 DarkLight1337 deleted the fix-clip-layernorm branch September 10, 2024 08:23
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Sep 12, 2024
@litianjian
Copy link
Contributor

@DarkLight1337 "len(self.encoder.layers) < config.num_hidden_layers" may not always determine skip the "post_layer_norm" . If model uses the first-last hidden states and skip the "post_layer_norm", int this case "len(self.encoder.layers) == config.num_hidden_layers".

@DarkLight1337
Copy link
Member Author

@DarkLight1337 "len(self.encoder.layers) < config.num_hidden_layers" may not always determine skip the "post_layer_norm" . If model uses the first-last hidden states and skip the "post_layer_norm", int this case "len(self.encoder.layers) == config.num_hidden_layers".

I see what you mean. Is there anything in the HF config that lets you toggle this behavior on or off though?

@litianjian
Copy link
Contributor

@DarkLight1337 "len(self.encoder.layers) < config.num_hidden_layers" may not always determine skip the "post_layer_norm" . If model uses the first-last hidden states and skip the "post_layer_norm", int this case "len(self.encoder.layers) == config.num_hidden_layers".

I see what you mean. Is there anything in the HF config that lets you toggle this behavior on or off though?

https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-ov-hf/blob/main/config.json
I am developing the llava-ov models on vLLM. This config triggers this issue.

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Sep 14, 2024

Looks like SigLIP has vision_config.use_vision_head setting. Let me open a PR for this.

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Sep 14, 2024

Nevermind, I see that the problem actually comes from how the features are extracted in the LLaVA-OneVision model. I think you may have to modify the existing encoders' forward method to decide whether to apply post layernorm when you develop Llava-OneVision in your PR.

@ywang96
Copy link
Member

ywang96 commented Sep 14, 2024

This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy?

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Sep 14, 2024

This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy?

We have several cases:

  1. The output state of an intermediate layer is used without layer normalization
  2. The output state of the last layer is used without layer normalization
  3. The output state of the last layer is used with layer normalization

Just from the config, we can only distinguish between case (1) and case (2, 3). We could add another parameter to the vision encoder model to specify whether to load post_layernorm, which further distinguishes between case (2) and case (3). Since the vision encoder could be initialized from the VLM, this lets each VLM control whether to use post_layernorm.

@ywang96
Copy link
Member

ywang96 commented Sep 14, 2024

This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy?

We have several cases:

  1. The output state of an intermediate layer is used without layer normalization
  2. The output state of the last layer is used without layer normalization
  3. The output state of the last layer is used with layer normalization

Just from the config, we can only distinguish between case (1) and case (2, 3). We could add another parameter to the vision encoder model to specify whether to load post_layernorm, which further distinguishes between case (2) and case (3). Since the vision encoder could be initialized from the VLM, this lets each VLM control whether to use post_layernorm.

Yea, this is also what I meant by an internal mapping (to keep track which models need post layernorm, and which not), or as you suggested this can be just an extra parameter in each model's implementation (but the user won't be able to change it from the config). Either way, this is something vLLM "maintains" instead of relying on the config file.

@litianjian
Copy link
Contributor

This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy?

We have several cases:

  1. The output state of an intermediate layer is used without layer normalization
  2. The output state of the last layer is used without layer normalization
  3. The output state of the last layer is used with layer normalization

Just from the config, we can only distinguish between case (1) and case (2, 3). We could add another parameter to the vision encoder model to specify whether to load post_layernorm, which further distinguishes between case (2) and case (3). Since the vision encoder could be initialized from the VLM, this lets each VLM control whether to use post_layernorm.

Yea, this is also what I meant by an internal mapping (to keep track which models need post layernorm, and which not), or as you suggested this can be just an extra parameter in each model's implementation (but the user won't be able to change it from the config). Either way, this is something vLLM "maintains" instead of relying on the config file.

Could you please update me on the progress? This will trigger bugs in LLaVA models with multi videos/images input.

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Oct 9, 2024

This seems to get unnecessarily complicated than what it should be. Perhaps we can keep an internal mapping to see which VLM's vision encoders actually use the post layernorm instead of relying on the number of hidden layers as a proxy?

We have several cases:

  1. The output state of an intermediate layer is used without layer normalization
  2. The output state of the last layer is used without layer normalization
  3. The output state of the last layer is used with layer normalization

Just from the config, we can only distinguish between case (1) and case (2, 3). We could add another parameter to the vision encoder model to specify whether to load post_layernorm, which further distinguishes between case (2) and case (3). Since the vision encoder could be initialized from the VLM, this lets each VLM control whether to use post_layernorm.

Yea, this is also what I meant by an internal mapping (to keep track which models need post layernorm, and which not), or as you suggested this can be just an extra parameter in each model's implementation (but the user won't be able to change it from the config). Either way, this is something vLLM "maintains" instead of relying on the config file.

Could you please update me on the progress? This will trigger bugs in LLaVA models with multi videos/images input.

This should be fixed already Sorry, just checked and it's not solved yet.

@DarkLight1337
Copy link
Member Author

Can you open a new issue so it can be easily tracked? Thanks and sorry for the delay, we have been quite swamped lately.

@litianjian
Copy link
Contributor

Can you open a new issue so it can be easily tracked? Thanks and sorry for the delay, we have been quite swamped lately.

Sure. Thanks.

@litianjian
Copy link
Contributor

Can you open a new issue so it can be easily tracked? Thanks and sorry for the delay, we have been quite swamped lately.

FYI. We can use #9186 to track it.

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants