Skip to content

Conversation

wnma3mz
Copy link
Contributor

@wnma3mz wnma3mz commented Sep 3, 2024

No issues have been raised at present, but it is a known bug.

  • What is the bug

When siglip acts as a multimodal vision encoder, there is no need for a post layer norm, which would otherwise result in unexpected output.

  • What is the fix

comment post_layer_norm

  • Why was the original implementation wrong?

In the LLaVA-Next code implementation, post_layernorm is not used.

It just used encoder_outputs.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/48890b0cb5da882ab584689244e74802ddbd4f75/llava/model/multimodal_encoder/siglip_encoder.py#L576-L587

@github-actions
Copy link

github-actions bot commented Sep 3, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 3, 2024

Can you add a test or modify the existing tests to be stricter so that this behaviour is checked? Thanks!

@wnma3mz
Copy link
Contributor Author

wnma3mz commented Sep 3, 2024

Can you add a test or modify the existing tests to be stricter so that this behaviour is checked? Thanks!

Sorry, I have no clue about adding relevant test cases.

I think this is a small modification, and the tensor shape is the same before and after the modification.

I don't know how to add test examples for this change, can you give me some specific tips?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 3, 2024

How did you discover this bug? You can add a test for the model that triggered it by updating the corresponding file under tests/models.

@wnma3mz
Copy link
Contributor Author

wnma3mz commented Sep 3, 2024

How did you discover this bug? You can add a test for the model that triggered it by updating the corresponding file under tests/models.

Thanks for your reply, I found this problem on a model I trained myself. There is no model for siglip as a vision encoder in huggingface. As a result, I couldn't test directly with llava/ llava-next.

Another way is to test the siglip model directly, but testing the vision model alone is not currently supported. Maybe I should write a new test case that supports visual encoder testing?

@DarkLight1337
Copy link
Member

Ah I see, I thought it was a problem with one of our existing models. In that case there is no need to add the test, thanks for fixing!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 3, 2024 11:07
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2024
@wnma3mz
Copy link
Contributor Author

wnma3mz commented Sep 3, 2024

Ah I see, I thought it was a problem with one of our existing models. In that case there is no need to add the test, thanks for fixing!

Thank you very much!🙏

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually, I tried this locally and it broke the tests. The existing vision models expect post_layernorm to be used. I think you'll have to add an optional argument to the forward method so that post_layernorm can be skipped specifically in your case.

Edit: See below message.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 3, 2024

This is because PaliGemma uses the last hidden state, whereas LLaVA models use the second-last hidden state. Only the last hidden state in transformers library is subject to post_layernorm. Therefore, the current implementation of CLIP in vLLM will also break if the last feature layer is selected (since post_layernorm isn't implemented).

So, the real fix would be to apply post_layernorm for each visual encoder in vLLM only if all of the encoder layers are used.

@wnma3mz
Copy link
Contributor Author

wnma3mz commented Sep 3, 2024

Hmm actually, I tried this locally and it broke the tests. The existing vision models expect post_layernorm to be used. I think you'll have to add an optional argument to the forward method so that post_layernorm can be skipped specifically in your case.

In the current llava based siglip model, post_layernorm is not used. It's like not using the head.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/48890b0cb5da882ab584689244e74802ddbd4f75/llava/model/multimodal_encoder/siglip_encoder.py#L576-L587

If necessary, I can add a parameter for selection.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 3, 2024

We have to fix both CLIP and SigLIP encoders:

  • CLIP should load post_layernorm if the last feature layer is selected. Currently, it always omits the layer.
  • SigLIP should omit post_layernorm if the last feature layer is selected. Currently, it always loads the layer.

@wnma3mz
Copy link
Contributor Author

wnma3mz commented Sep 3, 2024

But post_layernorm does not appear in ClipVisionTransformer. if possible, I could make similar modifications to clip.

@wnma3mz
Copy link
Contributor Author

wnma3mz commented Sep 3, 2024

This is because PaliGemma uses the last hidden state, whereas LLaVA models use the second-last hidden state. Only the last hidden state in transformers library is subject to post_layernorm. Therefore, the current implementation of CLIP in vLLM will also break if the last feature layer is selected (since post_layernorm isn't implemented).

So, the real fix would be to apply post_layernorm for each visual encoder in vLLM only if all of the encoder layers are used.

Thank you for your patient response. I have revised my submission.

Now determine whether the last layer is used by self.num_hidden_layers_override.

If the last layer is used, self_num_hidden_layers_override is None, it will use post_layer_norm; Otherwise, post_layer_norm will be skipped.

@DarkLight1337
Copy link
Member

Tests are failing. You need to update the weight loading logic as well.

@wnma3mz
Copy link
Contributor Author

wnma3mz commented Sep 3, 2024

Tests are failing. You need to update the weight loading logic as well.

thx! I have fixed the problem.

I added need_post_layernorm to control whether post_layernorm weights are initialized and loaded

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests pass so I'm approving this, thanks for the fix!

@DarkLight1337 DarkLight1337 merged commit d331156 into vllm-project:main Sep 4, 2024
@wnma3mz
Copy link
Contributor Author

wnma3mz commented Sep 4, 2024

Thank you for your patient guidance!

@litianjian
Copy link
Contributor

This is because PaliGemma uses the last hidden state, whereas LLaVA models use the second-last hidden state. Only the last hidden state in transformers library is subject to post_layernorm. Therefore, the current implementation of CLIP in vLLM will also break if the last feature layer is selected (since post_layernorm isn't implemented).
So, the real fix would be to apply post_layernorm for each visual encoder in vLLM only if all of the encoder layers are used.

Thank you for your patient response. I have revised my submission.

Now determine whether the last layer is used by self.num_hidden_layers_override.

If the last layer is used, self_num_hidden_layers_override is None, it will use post_layer_norm; Otherwise, post_layer_norm will be skipped.

@wnma3mz @DarkLight1337 "post_layer_norm" is not used in LLaVA model. But the changes in this issue may not solve this problem. Because "self_num_hidden_layers_override" is used to determine the layers of encoder, not for the "post_layer_norm".

@DarkLight1337
Copy link
Member

This is because PaliGemma uses the last hidden state, whereas LLaVA models use the second-last hidden state. Only the last hidden state in transformers library is subject to post_layernorm. Therefore, the current implementation of CLIP in vLLM will also break if the last feature layer is selected (since post_layernorm isn't implemented).
So, the real fix would be to apply post_layernorm for each visual encoder in vLLM only if all of the encoder layers are used.

Thank you for your patient response. I have revised my submission.
Now determine whether the last layer is used by self.num_hidden_layers_override.
If the last layer is used, self_num_hidden_layers_override is None, it will use post_layer_norm; Otherwise, post_layer_norm will be skipped.

@wnma3mz @DarkLight1337 "post_layer_norm" is not used in LLaVA model. But the changes in this issue may not solve this problem. Because "self_num_hidden_layers_override" is used to determine the layers of encoder, not for the "post_layer_norm".

Please see #8155

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants