[Core] Refactor `QKVCrossParallelLinear` implementation to support BNB 4-bit quantization #14545

Isotr0py · 2025-03-10T08:55:18Z

Revert [Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work #14498 and add BNB support to QKVCrossParallelLinear

cc @NickLucche

Signed-off-by: Isotr0py <[email protected]>

github-actions · 2025-03-10T08:55:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

NickLucche

Thanks a lot for this!

I do agree the solution is a bit involved rn, but perhaps we can find a way to simplify it a bit..
Apart from the comments I left, I would also:

add a test for unsloth/Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit
make sure VLLM_ATTENTION_BACKEND=XFORMERS python -m pytest -v tests/models/encoder_decoder/language/test_bart.py is also still working (not the case for me locally rn)

vllm/model_executor/layers/linear.py

Signed-off-by: Isotr0py <[email protected]>

Isotr0py · 2025-03-10T09:43:39Z

add a test for unsloth/Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit

Mllama is quite large for testing, and we won't test it by running inference on CI. I think testing it on whisper or bart might be a better selection.

Signed-off-by: Isotr0py <[email protected]>

NickLucche · 2025-03-10T09:46:53Z

Sure but we already have tests for it, though they're guarded by the 48gb requirement so they won't run on L4, still it is useful to run them locally with a single command.

Signed-off-by: Isotr0py <[email protected]>

vllm/model_executor/layers/linear.py

Signed-off-by: Isotr0py <[email protected]>

jeejeelee

LGTM, also cc @mgoin

…B 4-bit quantization (vllm-project#14545) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Richard Liu <[email protected]>

gshtras · 2025-03-21T22:53:21Z

This PR has a regression, breaking support for amd/Llama-3.2-11B-Vision-Instruct-FP8-KV quantized models. @Isotr0py

Isotr0py · 2025-03-22T04:44:28Z

This PR has a regression, breaking support for amd/Llama-3.2-11B-Vision-Instruct-FP8-KV quantized models.

Oh, fp8 need to call process_weights_after_loading, let me think a method to handle this for QKVCrossParallelLinear.

…B 4-bit quantization (vllm-project#14545) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…B 4-bit quantization (vllm-project#14545) Signed-off-by: Isotr0py <[email protected]>

…B 4-bit quantization (vllm-project#14545) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Mu Huai <[email protected]>

Isotr0py added 6 commits March 9, 2025 23:46

revert mllama and init x-qkv refactor

73ec79d

Signed-off-by: Isotr0py <[email protected]>

fix

288b3a9

Signed-off-by: Isotr0py <[email protected]>

it just work for bnb

8ef9fc1

Signed-off-by: Isotr0py <[email protected]>

refactor

4faab84

Signed-off-by: Isotr0py <[email protected]>

add doc string

2073544

Signed-off-by: Isotr0py <[email protected]>

Merge branch 'vllm-project:main' into refactor-x-qkv

9e8ae31

Isotr0py requested review from jeejeelee and mgoin March 10, 2025 08:56

NickLucche requested changes Mar 10, 2025

View reviewed changes

vllm/model_executor/layers/linear.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/linear.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/linear.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/linear.py Outdated Show resolved Hide resolved

make mypy happy

5f8b1ce

Signed-off-by: Isotr0py <[email protected]>

fix typo

9772f06

Signed-off-by: Isotr0py <[email protected]>

lints

3b75f93

Signed-off-by: Isotr0py <[email protected]>

jeejeelee reviewed Mar 10, 2025

View reviewed changes

vllm/model_executor/layers/linear.py Show resolved Hide resolved

Isotr0py added 2 commits March 11, 2025 00:12

add extra_repr

a479343

Signed-off-by: Isotr0py <[email protected]>

add mllama bnb test

e5feabe

Signed-off-by: Isotr0py <[email protected]>

Isotr0py requested review from DarkLight1337 and ywang96 as code owners March 11, 2025 09:17

fix bias attrs

0622685

Signed-off-by: Isotr0py <[email protected]>

jeejeelee approved these changes Mar 11, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) March 11, 2025 15:02

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2025

vllm-bot merged commit e392d85 into vllm-project:main Mar 12, 2025
46 of 48 checks passed

Isotr0py deleted the refactor-x-qkv branch March 12, 2025 03:40

Isotr0py mentioned this pull request Mar 22, 2025

[Bugfix] Handle process_weights_after_loading for QKVCrossParallelLinear #15328

Merged

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Core] Refactor QKVCrossParallelLinear implementation to support BN…

465339c

…B 4-bit quantization (vllm-project#14545) Signed-off-by: Isotr0py <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Refactor `QKVCrossParallelLinear` implementation to support BNB 4-bit quantization #14545

[Core] Refactor `QKVCrossParallelLinear` implementation to support BNB 4-bit quantization #14545

Uh oh!

Isotr0py commented Mar 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 10, 2025

Uh oh!

NickLucche left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Isotr0py commented Mar 10, 2025 •

edited

Loading

Uh oh!

NickLucche commented Mar 10, 2025

Uh oh!

Uh oh!

jeejeelee left a comment

Uh oh!

Uh oh!

gshtras commented Mar 21, 2025

Uh oh!

Isotr0py commented Mar 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[Core] Refactor QKVCrossParallelLinear implementation to support BNB 4-bit quantization #14545

[Core] Refactor QKVCrossParallelLinear implementation to support BNB 4-bit quantization #14545

Uh oh!

Conversation

Isotr0py commented Mar 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2025

Uh oh!

NickLucche left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Isotr0py commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche commented Mar 10, 2025

Uh oh!

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gshtras commented Mar 21, 2025

Uh oh!

Isotr0py commented Mar 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Core] Refactor `QKVCrossParallelLinear` implementation to support BNB 4-bit quantization #14545

[Core] Refactor `QKVCrossParallelLinear` implementation to support BNB 4-bit quantization #14545

Isotr0py commented Mar 10, 2025 •

edited by github-actions bot

Loading

NickLucche left a comment •

edited

Loading

Isotr0py commented Mar 10, 2025 •

edited

Loading