-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2
#5874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2
#5874
Conversation
ee060a6 to
6e71226
Compare
...el_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_unquantized.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving comments down
|
This is much better. I still think we have too much tied logic between I think the following two changes would make a better interface. Remove
|
BasevLLMParameter and weight_loader_v2
|
Couple nits but LGTM |
1263b08 to
bd2ee38
Compare
comaniac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. Overall LGTM so approve to unblock this PR and follow-up tasks.
...el_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_unquantized.py
Show resolved
Hide resolved
...model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_24.py
Show resolved
Hide resolved
...odel_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py
Show resolved
Hide resolved
...odel_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py
Show resolved
Hide resolved
...model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py
Show resolved
Hide resolved
|
|
||
| # WEIGHT SCALE | ||
| layer_kwargs = {"weight_loader": weight_loader} | ||
| # TODO: update create_xxx_parameter functions to return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We're not using the create_xxx_parameter methods here as they are used in places outside of compressed_tensors (e.g fp8). As a follow-up, once we've updated other quantization methods to use these new parameters, we can update the create_xx_parameter functions to return the vLLMParameters. They currently return torch.nn.parameters
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py
Show resolved
Hide resolved
| channelwise = (self.group_size == -1) | ||
| group_size = input_size if channelwise else self.group_size | ||
| channelwise = self.group_size == -1 | ||
| group_size = self.group_size if self.group_size != -1 else input_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second condition is just clearer as to what the group_size is and why
| group_size=weight_quant.group_size) | ||
|
|
||
| # Detect If Activation Quantization. | ||
| # TODO @dsikka: clean-up conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. General follow-up on the state of these conditions
| return CompressedTensorsW8A8Fp8( | ||
| strategy=weight_quant.strategy, | ||
| is_static_input_scheme=(not input_quant.dynamic)) | ||
| is_static_input_scheme=(input_quant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changing? Won't be always have input_quant if is_activation_quantization_format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just an extra check the activation config details aren't None/parsed correctly.
|
Make sure to unblock the multi-gpu A100 model correctness tests. Nice job! |
Head branch was pushed to by a user without write access
…eter` and `weight_loader_v2` (vllm-project#5874) Signed-off-by: Alvant <[email protected]>
…eter` and `weight_loader_v2` (vllm-project#5874) Signed-off-by: LeiWang1999 <[email protected]>
Summary
BasevLLMParameterModelWeightParameterGroupQuantScaleParameterChannelQuantScaleParameterPerTensorScaleParameterPackedvLLMParameterLinearBaseclassesweight_loadermethod in each of theLinearBaseclassescompressed-tensorsquantization configs by adding aweight_loader_v2method to each of theLinearBaseclasses. All other quantization configurations are still using the original weight loader, as part of the scope of this PRFOLLOW UP: