Update on "Add NVFP4 QAT"

andrewor14 · andrewor14 · commit 80cc50193407 · 2025-08-25T08:09:40.000-07:00
**Summary:** This commit adds a QAT flow for NVFP4, following the
numerics in `NVFP4Tensor` closely but without the dtyping casting,
swizzling, and the packing/unpacking. Users can call this flow as follows:

```
from torchao.quantization import quantize_
from torchao.quantization.qat import NVFP4FakeQuantizeConfig, QATConfig

qat_config = QATConfig(
    activation_config=NVFP4FakeQuantizeConfig(),
    weight_config=NVFP4FakeQuantizeConfig(),
    step="prepare",
)
quantize_(model, qat_config)
```

**Test Plan:**
```
python test/quantization/test_qat.py -k test_qat_nvfp4
```

Initial benchmarks on fine-tuning Qwen3-1.7B on alpaca for 3 epochs:
```
# Without QAT
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|------|---------------|---|------:|---|------|
|wikitext|      2|none  |None  |bits_per_byte  |↓  | 0.8322|±  |   N/A|
|        |       |none  |None  |byte_perplexity|↓  | 1.7804|±  |   N/A|
|        |       |none  |None  |word_perplexity|↓  |21.8611|±  |   N/A|

# With QAT
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|------|---------------|---|------:|---|------|
|wikitext|      2|none  |None  |bits_per_byte  |↓  | 0.8271|±  |   N/A|
|        |       |none  |None  |byte_perplexity|↓  | 1.7741|±  |   N/A|
|        |       |none  |None  |word_perplexity|↓  |21.4467|±  |   N/A|
```

[ghstack-poisoned]
diff --git a/docs/source/api_ref_qat.rst b/docs/source/api_ref_qat.rst
@@ -27,7 +27,6 @@ Custom QAT APIs
     FakeQuantizeConfigBase
     IntxFakeQuantizeConfig
     Float8FakeQuantizeConfig
-    NVFP4FakeQuantizeConfig
     FakeQuantizedLinear
     FakeQuantizedEmbedding
     FakeQuantizerBase
@@ -63,3 +62,5 @@ Prototype
     :nosignatures:
 
     initialize_fake_quantizers
+    NVFP4FakeQuantizeConfig
+    NVFP4FakeQuantizer
diff --git a/torchao/quantization/qat/fake_quantize_config.py b/torchao/quantization/qat/fake_quantize_config.py
@@ -80,7 +80,7 @@ def __post_init__(self):
 @dataclass
 class NVFP4FakeQuantizeConfig(FakeQuantizeConfigBase):
     """
-    Config for fake quantizing weights or activations to NVIDIA's NVFP4 format
+    (Prototype) Config for fake quantizing weights or activations to NVIDIA's NVFP4 format
     according to https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/.
 
     Fake quantization numerics follow `NVFP4Tensor` closely: https://github.com/pytorch/ao/blob/main/torchao/prototype/mx_formats/nvfp4_tensor.py.
diff --git a/torchao/quantization/qat/fake_quantizer.py b/torchao/quantization/qat/fake_quantizer.py
@@ -97,7 +97,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 
 class NVFP4FakeQuantizer(FakeQuantizerBase):
     """
-    Generic module for applying NVFP4 fake quantization to a tensor, as specified in the config.
+    (Prototype) Generic module for applying NVFP4 fake quantization to a tensor, as specified in the config.
     """
 
     def __init__(self, config: NVFP4FakeQuantizeConfig):