-
Notifications
You must be signed in to change notification settings - Fork 59
[High Risk]Support for immediate saving #965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
for more information, see https://pre-commit.ci
|
Thanks for the great work! Could you check the maximum RAM usage to see whether it has been reduced significantly, as expected? |
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
… into kaihui/save_block
Signed-off-by: Kaihui-intel <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Kaihui-intel <[email protected]>
auto_round/compressors/base.py
Outdated
| self.is_packing_immediate = False # whether to pack the layer immediately after tuning | ||
|
|
||
| # Whether to pack the layer immediately after tuning | ||
| self.is_packing_immediate = kwargs.pop("is_packing_immediate", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Packing immediate is set automatically before. Have you handled this when exporting >1 formats? So it’s better not to set it in the API. Besides, as discussed, set save_immediate to True.
Another thing to verify is the time cost of save_immediate, have you measured the total quantization time comparing to main branch ?
Signed-off-by: Kaihui-intel <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
auto_round/compressors/base.py
Outdated
| self.is_packing_immediate = False # whether to pack the layer immediately after tuning | ||
|
|
||
| # Whether to pack the layer immediately after tuning | ||
| self.immediate_packing = kwargs.pop("immediate_packing", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to expose this arg, automatically setting this is a better way
auto_round/compressors/base.py
Outdated
| q_layer_input = to_device(q_layer_input, self.cache_device) | ||
| quant_layer(layer_name, layer_input, q_layer_input, device=self.device) | ||
| if self.immediate_packing: | ||
| from auto_round.export import PACKING_LAYER_WITH_FORMAT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrapper it as a function
auto_round/compressors/base.py
Outdated
| self.is_packing_immediate = False # whether to pack the layer immediately after tuning | ||
|
|
||
| # Whether to pack the layer immediately after tuning | ||
| self.immediate_packing = kwargs.pop("immediate_packing", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the code to set it to False for fake format or multiple formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto-round/auto_round/compressors/base.py
Lines 1546 to 1562 in 5375be6
| if not hasattr(self, "formats"): | |
| logger.warning("this API is deprecated, please use `quantize_and_save` instead") | |
| else: | |
| # Determine if immediate packing is required | |
| formats = self.formats | |
| if ( | |
| len(formats) == 1 | |
| and ( | |
| "awq" in formats[0] | |
| or "gptq" in formats[0] | |
| or "auto_round" in formats[0] | |
| or "gguf" in formats[0] | |
| or "llm_compressor" in formats[0] | |
| ) | |
| and self.inplace | |
| ): | |
| self.immediate_packing = True |
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
for more information, see https://pre-commit.ci
Accuracy
#788
Memory
memory check
Qwen2.5-7B-Instruct-w4g32 RTN auto_round
mprof peak
16659.441MiB->9200.250MiB~55%Time
quantization and saving time
opt-125m
Qwen2.5-7B-Instruct
immediate pacing and saving now only support formats[0]