Add way to save quantize config and can be loaded again #93

fahadh4ilyas · 2024-07-15T17:24:53Z

Because quant_config is gone when you load model using from_quantized. I tried to re-add the quant_config here so then when we call prepare_for_inference for loaded quantized model, it will not crash because quant_config not found.

mobicham · 2024-07-16T09:42:26Z

Thanks a lot for the effort @fahadh4ilyas !

That is correct, as a temporary solution, there's this patching functions that adds a quant_config: https://github.com/mobiusml/hqq/blob/master/hqq/utils/patching.py#L29

There's an easy way to do this, without needing a separate json:

Add self.quant_config in HQQLinear.state_dict()
In load_state_dict, you simply do self.quant_config = state_dict['quant_config']

However, this is going in a different direction, will explain below.

Current Direction

I am currently refactoring the whole serialization logic to make it compatible with safetensors. The goal is to be able to directly save/load HQQ-quantized nodels with HF transformers.
Safetensors has many limitations: we can only put torch.Tensor as value, and nested dictionaries are not allowed. So we can't just put it directly in state_dict.

For the moment, I added support for quant_config loading from a safetensors-compatible state_dict, but it doesn't support quantized scale/zero just yet: https://github.com/mobiusml/hqq/blob/master/hqq/core/quantize.py#L569-L601

The way how it works right now is that state_dict supports both the old format, and a new encoded format that encodes anything that is not torch.Tensor into a torch.Tensor, controlled by the flag self.encoded_state_dict , which is by default True now.

import torch
compute_dtype = torch.float16
device = 'cuda:0'

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3-8B', torch_dtype=compute_dtype, cache_dir='/nas/hicham/tmp/')

#Quantize
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
from hqq.models.hf.base import AutoHQQHFModel
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, quant_scale=False, quant_zero=False, axis=1, offload_meta=False) 
model = AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)

##########################################
#Safetensors save/load layer check
from safetensors import safe_open
from safetensors.torch import save_file

 _state_dict = model.model.layers[0].self_attn.q_proj.state_dict()
 save_file(_state_dict, "layer.safetensors")

 _state_dict_loaded= {}
 with safe_open("layer.safetensors", framework="pt") as f:
    for key in f.keys():
        _state_dict_loaded[key] = f.get_tensor(key)
#######################################
#Model save/load check (with hqq lib)

AutoHQQHFModel.save_quantized(model, 'llama3-hqq')
model_loaded = AutoHQQHFModel.from_quantized("llama3-hqq")

#quant_config loaded
print(model_loaded.model.layers[0].self_attn.q_proj.quant_config)

Next step is to use this logic to save/load HQQ-quantized model with HF transformers. Then we can get back to supporting quantized scale/zero.

Happy to hear suggestions from you regarding this!

fahadh4ilyas · 2024-07-16T10:07:04Z

Doesn't safetensors support metadata? How about the meta and quant_config is put inside the metadata?

mobicham · 2024-07-16T10:16:56Z

Yeah I thought about it, but it will make things even more complicated, since it will require more work on the transformers lib side. Putting everything in state_dict simplifies the process since iterating is much quicker and we have more freedom.

fahadh4ilyas · 2024-07-16T10:29:54Z

Yeah I thought about it, but it will make things even more complicated, since it will require more work on the transformers lib side. Putting everything in state_dict simplifies the process since iterating is much quicker and we have more freedom.

What do you mean by "will require more work on the transformers"? Because the current way to save_quantize doesn't require to change transformers.

By using metadata, we only split current state_dict into two parts which are "tensors" and "non-tensors". Later, when we want to load_weight, we just combine splitted dictionary, right?

mobicham · 2024-07-16T10:56:13Z

hqq's save_quantized wouldn't require changes in transformers that's correct, but the goal is to have official serialization support with HF transformers directly, so we would be able to save models via save_pretrained, not just via the hqq lib.

I am trying to see what is the right way of doing this with @SunMarc

For now, the logic is working just fine with state_dict: we can save the whole model as a safetensor, and hqq lib save_quantized is also working with the same state_dict format, which is good. The only limitation is that meta-data offloading/quantized scale-zero is not supported. Meta-data offloading / quant scale-zero is anyway not supported by the fast backends, so I was even thinking of completely dropping off support for it since we can't even use it for fast inference.
So from 0.2.0 on, we keep things simple and only have floating-point scale/zero on the same device.

mobicham · 2024-07-16T12:03:44Z

I also tried loading a model saved with the previous version (https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_4bitnogs_hqq) and it worked without any issue, which is good news for backward compatibility.
Now we just need to see how HQQLinear.load_state_dict behaves when used inside HF transformers.

mobicham · 2024-07-18T13:45:53Z

Draft pull request here: huggingface/transformers#32056

mobicham · 2024-08-28T10:06:09Z

Closing this since we are very close to full transformers serialization support: huggingface/transformers#33141

fahadh4ilyas and others added 3 commits July 16, 2024 00:22

Add way to save quantize config and can be loaded again

a52c8f2

Make sure quantize_config.json only written once

df86d05

Merge branch 'mobiusml:master' into save-quantize-config

ddeb941

Merge remote-tracking branch 'upstream/master' into save-quantize-config

dd649d9

mobicham mentioned this pull request Jul 19, 2024

About the implentation of .cpu() #96

Closed

mobicham closed this Aug 28, 2024

zxbjushuai mentioned this pull request Sep 11, 2024

question about fine tune 1bit-quanitzed model #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add way to save quantize config and can be loaded again #93

Add way to save quantize config and can be loaded again #93

Uh oh!

fahadh4ilyas commented Jul 15, 2024

Uh oh!

mobicham commented Jul 16, 2024

Uh oh!

fahadh4ilyas commented Jul 16, 2024

Uh oh!

mobicham commented Jul 16, 2024

Uh oh!

fahadh4ilyas commented Jul 16, 2024

Uh oh!

mobicham commented Jul 16, 2024

Uh oh!

mobicham commented Jul 16, 2024

Uh oh!

mobicham commented Jul 18, 2024

Uh oh!

mobicham commented Aug 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add way to save quantize config and can be loaded again #93

Add way to save quantize config and can be loaded again #93

Uh oh!

Conversation

fahadh4ilyas commented Jul 15, 2024

Uh oh!

mobicham commented Jul 16, 2024

Current Direction

Uh oh!

fahadh4ilyas commented Jul 16, 2024

Uh oh!

mobicham commented Jul 16, 2024

Uh oh!

fahadh4ilyas commented Jul 16, 2024

Uh oh!

mobicham commented Jul 16, 2024

Uh oh!

mobicham commented Jul 16, 2024

Uh oh!

mobicham commented Jul 18, 2024

Uh oh!

mobicham commented Aug 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants