Skip to content

Conversation

@Kaihui-intel
Copy link
Contributor

@Kaihui-intel Kaihui-intel commented Oct 30, 2025

Accuracy

scheme /(opt-125m,) format RTN iter>0
W4A16 auto_round 0.2882 0.3526
W2A16 auto_round   0.1657
W3A16 auto_round   0.3247
W8A16 auto_round   0.3784
bit s group_size 32 auto_round 0.3749 0.3679
bit s group_size 32 auto_gptq 0.3747 0.3658
bit s group_size 32 auto_awq 0.3749 0.3646

#788

Memory

memory check
Qwen2.5-7B-Instruct-w4g32 RTN auto_round
mprof peak
16659.441MiB->9200.250MiB ~55%

Time

quantization and saving time
opt-125m

branch RTN iter>0
cur branch 67s 78s
main branch 50s 88s

Qwen2.5-7B-Instruct

branch RTN iter>0
cur branch 4min4s 18min10s
main branch 3min54s 17min53s

immediate pacing and saving now only support formats[0]

Kaihui-intel and others added 8 commits October 16, 2025 01:41
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
@wenhuach21
Copy link
Contributor

Thanks for the great work! Could you check the maximum RAM usage to see whether it has been reduced significantly, as expected?

Signed-off-by: Kaihui-intel <[email protected]>
@xin3he xin3he modified the milestones: 1.0, 0.9.0 Oct 30, 2025
@wenhuach21 wenhuach21 changed the title Support for immediate saving [High Risk]Support for immediate saving Oct 31, 2025
self.is_packing_immediate = False # whether to pack the layer immediately after tuning

# Whether to pack the layer immediately after tuning
self.is_packing_immediate = kwargs.pop("is_packing_immediate", False)
Copy link
Contributor

@wenhuach21 wenhuach21 Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packing immediate is set automatically before. Have you handled this when exporting >1 formats? So it’s better not to set it in the API. Besides, as discussed, set save_immediate to True.
Another thing to verify is the time cost of save_immediate, have you measured the total quantization time comparing to main branch ?

self.is_packing_immediate = False # whether to pack the layer immediately after tuning

# Whether to pack the layer immediately after tuning
self.immediate_packing = kwargs.pop("immediate_packing", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to expose this arg, automatically setting this is a better way

q_layer_input = to_device(q_layer_input, self.cache_device)
quant_layer(layer_name, layer_input, q_layer_input, device=self.device)
if self.immediate_packing:
from auto_round.export import PACKING_LAYER_WITH_FORMAT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapper it as a function

self.is_packing_immediate = False # whether to pack the layer immediately after tuning

# Whether to pack the layer immediately after tuning
self.immediate_packing = kwargs.pop("immediate_packing", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the code to set it to False for fake format or multiple formats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not hasattr(self, "formats"):
logger.warning("this API is deprecated, please use `quantize_and_save` instead")
else:
# Determine if immediate packing is required
formats = self.formats
if (
len(formats) == 1
and (
"awq" in formats[0]
or "gptq" in formats[0]
or "auto_round" in formats[0]
or "gguf" in formats[0]
or "llm_compressor" in formats[0]
)
and self.inplace
):
self.immediate_packing = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants