[Feature]: Output state configuration of vision encoder In VLM

### Anything you want to discuss about vllm.

When siglip or clip acts as a multimodal vision encoder,  there will have several cases:
- The output state of an intermediate layer is used without layer normalization
- The output state of the last layer is used without layer normalization
- The output state of the last layer is used with layer normalization

For example, In the `LLaVA-Next` code implementation, `post_layernorm` is not used.

#8106 #8155

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Output state configuration of vision encoder In VLM #9186

Anything you want to discuss about vllm.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Output state configuration of vision encoder In VLM #9186

Description

Anything you want to discuss about vllm.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions