Commit f02db41
committed
Support using Int4PreshuffledTensor after loading
Summary:
Int4PreshuffledTensor has fasted int4 kernel for int4 weight only and fp8 act + int4 weight
in fbgemm, but we can't slice the Tensor due to the preshuffling (and slice has to preserve alias)
so we have to use Int4Tensor (plain format) so it can be sliced during loading, and convert
the Tensor to preshuffled format after loading using
`torchao.prototype.tensor_conversion.api.convert_to_packed_tensor_based_on_current_hardware`
function.
Test Plan:
pytest tests/quantization/test_torchao.py -k test_opt_125m_int4wo_model_running_preshuffled_kernel
For test we uploaded a plain int4 tensor checkpoint https://huggingface.co/torchao-testing/opt-125m-Int4WeightOnlyConfig-v2-0.14.0.dev
and load it in vllm, then check the model is transformed to use Int4PreshuffledTensor before inference
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Jerry Zhang <[email protected]>1 parent c312468 commit f02db41
File tree
2 files changed
+33
-0
lines changed- tests/quantization
- vllm/model_executor/layers/quantization
2 files changed
+33
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
211 | 211 | | |
212 | 212 | | |
213 | 213 | | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
214 | 241 | | |
215 | 242 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
260 | 260 | | |
261 | 261 | | |
262 | 262 | | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
263 | 269 | | |
264 | 270 | | |
265 | 271 | | |
| |||
0 commit comments