[bitsandbytes]: support read bnb pre-quantized model #5753

thesues · 2024-06-21T20:42:03Z

huggingface is bitsandbytes pre quantized model such as

lllyasviel/omost-llama-3-8b-4bit
unsloth/tinyllama-bnb-4bit
...

this will support these pre quantized for vllm

chenqianfzh · 2024-06-25T21:50:37Z

The changes from @thesues improves my previous work on QLoRA & Bnb (#4776). It now supports models whose weights are published as bnb-quantized.

Other than that, it also cleaned my previous code and fixed a bug ( the previous version will run into error in scenarios such as GQA).

The changes looks good to me. Could you also take a look?

mgoin · 2024-06-25T23:09:40Z

vllm/model_executor/model_loader/loader.py

I'm not fond of the term "prequant" here, could it be something along the lines of "quantized_checkpoint"?

vllm/model_executor/model_loader/loader.py

mgoin · 2024-06-25T23:11:49Z

vllm/model_executor/model_loader/loader.py

What isn't supported about it? It seems like no exception is being thrown

a typo here, it should be "only quant_state.bitsandbytes__nf4 is supported". other libraries such as hf transformer supports quant_state.bitsandbytes__fp4.

mgoin · 2024-06-25T23:12:35Z

vllm/model_executor/model_loader/loader.py

Ditto on pre_quant, it should be talking about the checkpoint being quantized

mgoin

Appreciate the improvements and ability to natively load! I think it would be great to followup with a documentation page in the quantization section to show how to deploy bnb models directly in vLLM, perhaps straight from a quick finetune in unsloth.

thesues · 2024-06-27T02:04:01Z

Appreciate the improvements and ability to natively load! I think it would be great to followup with a documentation page in the quantization section to show how to deploy bnb models directly in vLLM, perhaps straight from a quick finetune in unsloth.

sure, I added a very simple bnb.rst in docs

thesues · 2024-07-02T16:32:03Z

@mgoin can you review this version?

QwertyJack · 2024-07-14T01:00:29Z

docs/source/quantization/bnb.rst

Redundant commma.

docs/source/quantization/bnb.rst

vllm/model_executor/layers/quantization/bitsandbytes.py

vrdn-23 · 2024-07-15T16:51:36Z

Is this PR good to merge now? It would be really great if we could get it in before the next scheduled release (#6433)!
cc @mgoin @QwertyJack

docs/source/quantization/bnb.rst

tests/quantization/test_bitsandbytes.py

vllm/model_executor/model_loader/loader.py

QwertyJack · 2024-07-16T01:39:36Z

Is this PR good to merge now? It would be really great if we could get it in before the next scheduled release (#6433)! cc @mgoin @QwertyJack

SGTM

thesues · 2024-07-19T21:50:00Z

Is this PR good to merge now? It would be really great if we could get it in before the next scheduled release (#6433)! cc @mgoin @QwertyJack

SGTM

is there anything I could do to improve this patch?

eliasecchig · 2024-07-22T14:26:11Z

Any ETA for this feature to be merged? Really keen to use it

mgoin

Thanks for pinging, LGTM with a small docs fix

docs/source/quantization/bnb.rst

QwertyJack · 2024-07-23T04:02:00Z

Is this PR good to merge now? It would be really great if we could get it in before the next scheduled release (#6433)! cc @mgoin @QwertyJack

SGTM

As I mentioned earlier, Look Good To Me!

njhill · 2024-07-23T15:35:40Z

@thesues do you think you could resolve the new conflicts?

Maxusmusti · 2024-07-23T16:50:08Z

I've just tried installing from this PR's branch and testing it out, this solution worked for me!

python3 -m vllm.entrypoints.openai.api_server --model <bnb_4bit_model> --quantization bitsandbytes --load-format bitsandbytes

used to produce issues like:
ValueError: Cannot find any of ['adapter_name_or_path'] in the model's quantization config.
and once args were added:
KeyError: 'model.layers.0.self_attn.qkv_proj.weight'

but now model loading seems to work great:

WARNING 07-23 16:35:07 config.py:218] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
...
INFO 07-23 16:35:08 loader.py:800] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 07-23 16:35:10 model_runner.py:164] Loading model weights took 3.6494 GB
INFO 07-23 16:35:11 gpu_executor.py:83] # GPU blocks: 1977, # CPU blocks: 512
...
INFO:     Started server process [15386]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000/ (Press CTRL+C to quit)

mgoin · 2024-07-23T16:57:53Z

It would be great if someone could validate that the new Llama 3.1 8B BNB checkpoint loads: https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4

thesues · 2024-07-23T21:05:07Z

It would be great if someone could validate that the new Llama 3.1 8B BNB checkpoint loads: https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4

Meta-Llama-3.1-8B-Instruct-BNB-NF4 works as expected.

from vllm import LLM, SamplingParams


model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4"
llm = LLM(model=model_id, trust_remote_code=True, enforce_eager=True, quantization="bitsandbytes", load_format="bitsandbytes", m
ax_model_len=4096)
sampling_params = SamplingParams(max_tokens=20)

outputs = llm.generate("The capital of United States is", sampling_params)
for output in outputs:
    print(output.outputs[0].text)

logs:

WARNING 07-23 21:01:21 config.py:246] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 07-23 21:01:21 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4', speculative_config=None, tokenizer='hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-23 21:01:22 model_runner.py:680] Starting to load model hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4...
INFO 07-23 21:01:22 loader.py:821] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 07-23 21:01:22 weight_utils.py:224] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  5.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.68it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  5.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.66it/s]

INFO 07-23 21:01:25 model_runner.py:692] Loading model weights took 5.3424 GB
INFO 07-23 21:01:27 gpu_executor.py:102] # GPU blocks: 6603, # CPU blocks: 2048
Processed prompts: 100%|█████████████████████| 1/1 [00:01<00:00,  1.57s/it, est. speed input: 3.83 toks/s, output: 25.52 toks/s]
 Washington, D.C.

…el/omost-llama-3-8b-4bits

Co-authored-by: Michael Goin <[email protected]>

thesues · 2024-07-23T22:15:10Z

@thesues do you think you could resolve the new conflicts?

done

mgoin

Thanks a lot for testing further and sticking with this, LGTM!

Co-authored-by: Michael Goin <[email protected]>

Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Alvant <[email protected]>

Co-authored-by: Michael Goin <[email protected]> Signed-off-by: LeiWang1999 <[email protected]>

thesues force-pushed the nf5 branch 2 times, most recently from 69decab to 95dabaa Compare June 21, 2024 23:26

thesues force-pushed the nf5 branch from 95dabaa to 559886a Compare June 25, 2024 22:13

mgoin reviewed Jun 25, 2024

View reviewed changes

thesues force-pushed the nf5 branch 3 times, most recently from fb0a6d2 to b7c3aae Compare June 27, 2024 00:10

thesues force-pushed the nf5 branch from b7c3aae to 608f276 Compare June 27, 2024 03:07

odulcy-mindee mentioned this pull request Jul 5, 2024

unsloth 4bit models do not load in vLLM - says missing adapter path or name unslothai/unsloth#688

Closed

chenqianfzh mentioned this pull request Jul 12, 2024

[Bug]: BitsandBytes quantization is not working as expected #5569

Closed

QwertyJack reviewed Jul 14, 2024

View reviewed changes

thesues force-pushed the nf5 branch from 608f276 to 8b89293 Compare July 14, 2024 04:26

mgoin reviewed Jul 15, 2024

View reviewed changes

thesues force-pushed the nf5 branch 12 times, most recently from 8e891bb to bb014ef Compare July 15, 2024 22:29

mgoin approved these changes Jul 22, 2024

View reviewed changes

docs/source/quantization/bnb.rst Outdated Show resolved Hide resolved

thesues requested a review from QwertyJack July 22, 2024 19:48

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 23, 2024

thesues force-pushed the nf5 branch from 9019404 to ad9a9b0 Compare July 23, 2024 20:01

RobotSail mentioned this pull request Jul 23, 2024

Add autodetection and vllm arg adjustment for bitsandbytes quantization instructlab/instructlab#1843

Merged

5 tasks

thesues and others added 2 commits July 23, 2024 21:56

[bitsandbytes]: support read bnb pre-quantized model such as lllyasvi…

d5387a3

…el/omost-llama-3-8b-4bits

Update docs/source/quantization/bnb.rst

d05e8d4

Co-authored-by: Michael Goin <[email protected]>

thesues force-pushed the nf5 branch from ad9a9b0 to d05e8d4 Compare July 23, 2024 21:57

mgoin approved these changes Jul 23, 2024

View reviewed changes

mgoin enabled auto-merge (squash) July 23, 2024 22:28

mgoin merged commit 87525fa into vllm-project:main Jul 23, 2024

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[bitsandbytes]: support read bnb pre-quantized model (vllm-project#5753)

ec96c2e

Co-authored-by: Michael Goin <[email protected]>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[bitsandbytes]: support read bnb pre-quantized model (vllm-project#5753)

fdd9547

Co-authored-by: Michael Goin <[email protected]>

n1hility pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 24, 2024

[bitsandbytes]: support read bnb pre-quantized model (vllm-project#5753)

9e0c3da

Co-authored-by: Michael Goin <[email protected]>

dtrifiro mentioned this pull request Aug 5, 2024

Sync with [email protected] opendatahub-io/vllm#120

Closed

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[bitsandbytes]: support read bnb pre-quantized model (vllm-project#5753)

501bb0c

Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Alvant <[email protected]>

noooop mentioned this pull request Jan 6, 2025

[Feature]: Support Inflight quantization: load as 8bit quantization. #11655

Open

1 task

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[bitsandbytes]: support read bnb pre-quantized model (vllm-project#5753)

c1523f3

Co-authored-by: Michael Goin <[email protected]> Signed-off-by: LeiWang1999 <[email protected]>

Uh oh!

[bitsandbytes]: support read bnb pre-quantized model #5753

[bitsandbytes]: support read bnb pre-quantized model #5753

Uh oh!

Conversation

thesues commented Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenqianfzh commented Jun 25, 2024

Uh oh!

mgoin Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

thesues Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

thesues Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

mgoin Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

thesues commented Jun 27, 2024

Uh oh!

thesues commented Jul 2, 2024

Uh oh!

QwertyJack Jul 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vrdn-23 commented Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

QwertyJack commented Jul 16, 2024

Uh oh!

thesues commented Jul 19, 2024

Uh oh!

eliasecchig commented Jul 22, 2024

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

QwertyJack commented Jul 23, 2024

Uh oh!

njhill commented Jul 23, 2024

Uh oh!

Maxusmusti commented Jul 23, 2024

Uh oh!

mgoin commented Jul 23, 2024

Uh oh!

thesues commented Jul 23, 2024

Uh oh!

thesues commented Jul 23, 2024

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

thesues commented Jun 21, 2024 •

edited

Loading

vrdn-23 commented Jul 15, 2024 •

edited

Loading