You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Summary:
Adds SMOOTHQUANT-W8A8 quantization method to the TorchAO model release pipeline.
- Adjusted defaults: Increased calibration samples from 10 to 128 to
ensure consistency, reduced max sequence length (SeqLen) from 2048 to 1024
- Updated HF CLI command: `huggingface-cli login` to `hf auth login`
Test plan:
```bash
python quantize_and_upload.py --model_id Qwen/Qwen3-8B --quant SMOOTHQUANT-W8A8 --push_to_hub --task bbh
```
* add SmoothQuant uploader
* separate docs for AWQ & SmoothQuant
* rename SMOOTHQUANT-W8A8 to SMOOTHQUANT-INT8-INT8
* add SmoothQuant release example
* update example in docs
* rename SMOOTHQUANT-INT8-INT8 to SmoothQuant-INT8-INT8
* rename SMOOTHQUANT to SmoothQuant
* revert max_seq_length default to 2048
Note: for initial release, please include `--populate_model_card_template` to populate model card template.
65
64
65
+
### SmoothQuant-INT8-INT8
66
+
[SmoothQuant](https://arxiv.org/abs/2211.10438) smooths activation outliers by migrating quantization difficulty from activations to weights through a mathematically equivalent per-channel scaling transformation. That means SmoothQuant observes activation distribution before applying quantization.
67
+
68
+
Examples:
69
+
```
70
+
# release SmoothQuant-INT8-INT8 model, calibrated with a specific task
[AWQ](https://arxiv.org/abs/2306.00978)is a technique to improve accuracy for weight only quantization. It improves accuracy by preserving "salient" weight channels that has high impact on the accuracy of output, through multiplying the weight channel by a scale, and do the reverse for the correspnoding activation, since activation is not quantized, there is no additional loss from activation, while the quantization loss from weight can be reduced.
75
+
Similar to SmoothQuant, [AWQ](https://arxiv.org/abs/2306.00978) improves accuracy by preserving "salient" weight channels that has high impact on the accuracy of output. The notable point is that AWQ uses activation distribution to find salient weights, not weight distribution, multiplying the weight channel by a scale, and doing the reverse for the corresponding activation. Since activation is not quantized, there is no additional loss from activation, while the quantization loss from weight can be reduced.
68
76
69
77
After eval for INT4 checkpoint is done, we might find some task have a large accuracy drop compared to high precision baseline, in that case we can do a calibration for that task, with a few samples, tasks are selected from [lm-eval](https://github.com/EleutherAI/lm-eval\uation-harness/blob/main/lm_eval/tasks/README.md). You can follow [new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to add new tasks to lm-eval.
0 commit comments