pytorch
diff --git a/‎.github/workflows/regression_test_rocm.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/regression_test_rocm.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 147 additions & 128 deletions b/‎README.md‎
Lines changed: 147 additions & 128 deletions
diff --git a/‎benchmarks/float8/training/torchtitan_benchmark.sh‎
Lines changed: 1 addition & 1 deletion b/‎benchmarks/float8/training/torchtitan_benchmark.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/requirements.txt‎
Lines changed: 2 additions & 0 deletions b/‎docs/requirements.txt‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/api_ref_sparsity.rst‎
Lines changed: 0 additions & 1 deletion b/‎docs/source/api_ref_sparsity.rst‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/source/conf.py‎
Lines changed: 6 additions & 1 deletion b/‎docs/source/conf.py‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎docs/source/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/quantization.rst‎
Lines changed: 5 additions & 5 deletions b/‎docs/source/quantization.rst‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/source/quick_start.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/quick_start.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/serialization.rst‎
Lines changed: 3 additions & 3 deletions b/‎docs/source/serialization.rst‎
Lines changed: 3 additions & 3 deletions
@@ -31,7 +31,7 @@ jobs:
       contents: read
     uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
     with:
-      timeout: 120
+      timeout: 150
       no-sudo: ${{ matrix.gpu-arch-type == 'rocm' }}
       runner: ${{ matrix.runs-on }}
       gpu-arch-type: ${{ matrix.gpu-arch-type }}
 
@@ -29,7 +29,7 @@ fi
 # validate recipe name
 if [ -n "${FLOAT8_RECIPE_WITH_BEST_SETTINGS}" ]; then
   if [ "${FLOAT8_RECIPE_WITH_BEST_SETTINGS}" == "tensorwise" ]; then
-    FLOAT8_ARGS="--model.converters="float8" --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp --float8.force_recompute_fp8_weight_in_bwd"
+    FLOAT8_ARGS="--model.converters="float8" --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp"
   else
     FLOAT8_ARGS="--model.converters="float8" --float8.recipe_name=${FLOAT8_RECIPE_WITH_BEST_SETTINGS}"
   fi
 
@@ -4,4 +4,6 @@ sphinx_design
 sphinx_copybutton
 sphinx-tabs
 matplotlib
+myst-parser
+sphinxcontrib-mermaid==1.0.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
@@ -12,7 +12,6 @@ torchao.sparsity
 
     sparsify_
     semi_sparse_weight
-    int8_dynamic_activation_int8_semi_sparse_weight
     apply_fake_sparsity
     WandaSparsifier
     PerChannelNormObserver
@@ -50,6 +50,8 @@
     "sphinx_design",
     "sphinx_gallery.gen_gallery",
     "sphinx_copybutton",
+    "myst_parser",
+    "sphinxcontrib.mermaid",
 ]
 
 sphinx_gallery_conf = {
@@ -96,7 +98,10 @@
 # The suffix(es) of source filenames.
 # You can specify multiple suffix as a list of string:
 #
-source_suffix = [".rst"]
+source_suffix = {
+    ".rst": "restructuredtext",
+    ".md": "markdown",
+}
 
 # The master toctree document.
 master_doc = "index"
 
@@ -42,3 +42,4 @@ for an overall introduction to the library and recent highlight and updates.
    subclass_advanced
    static_quantization
    pretraining
+   torchao_vllm_integration
@@ -12,7 +12,7 @@ First we want to lay out the torchao stack::
               Basic dtypes: uint1-uint7, int1-int8, float3-float8
 
 
-Any quantization algorithm will be using some components from the above stack, for example int4_weight_only quantization uses:
+Any quantization algorithm will be using some components from the above stack, for example int4 weight-only quantization uses:
 (1) weight only quantization flow
 (2) `tinygemm bf16 activation + int4 weight kernel <https://github.com/pytorch/pytorch/blob/136e28f616140fdc9fb78bb0390aeba16791f1e3/aten/src/ATen/native/native_functions.yaml#L4148>`__ and `quant primitive ops <https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_primitives.py>`__
 (3) `AffineQuantizedTensor <https://github.com/pytorch/ao/blob/main/torchao/dtypes/affine_quantized_tensor.py>`__ tensor subclass with `TensorCoreTiledLayout <https://github.com/pytorch/ao/blob/e41ca4ee41f5f1fe16c59e00cffb4dd33d25e56d/torchao/dtypes/affine_quantized_tensor.py#L573>`__
@@ -201,7 +201,7 @@ Case Study: How int4 weight only quantization works in torchao?
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 To connect everything together, here is a more detailed walk through for how int4 weight only quantization is implemented in torchao.
 
-Quantization Flow: quantize_(model, int4_weight_only())
+Quantization Flow: quantize_(model, Int4WeightOnlyConfig())
     * What happens: linear.weight = torch.nn.Parameter(to_affine_quantized_intx(linear.weight), requires_grad=False)
     * quantization primitive ops: choose_qparams and quantize_affine are called to quantize the Tensor
     * quantized Tensor will be `AffineQuantizedTensor`, a quantized tensor with derived dtype (e.g. int4 with scale and zero_point)
@@ -212,10 +212,10 @@ During Model Execution: model(input)
 
 During Quantization
 ###################
-First we start with the API call: ``quantize_(model, int4_weight_only())`` what this does is it converts the weights of nn.Linear modules in the model to int4 quantized tensor (``AffineQuantizedTensor`` that is int4 dtype, asymmetric, per group quantized), using the layout for tinygemm kernel: ``tensor_core_tiled`` layout.
+First we start with the API call: ``quantize_(model, Int4WeightOnlyConfig())`` what this does is it converts the weights of nn.Linear modules in the model to int4 quantized tensor (``AffineQuantizedTensor`` that is int4 dtype, asymmetric, per group quantized), using the layout for tinygemm kernel: ``tensor_core_tiled`` layout.
 
-* `quantize_ <https://github.com/pytorch/ao/blob/4865ee61340cc63a1469f437388067b853c9289e/torchao/quantization/quant_api.py#L403>`__: the model level API that quantizes the weight of linear by applying the conversion function from user (second argument)
-* `int4_weight_only <https://github.com/pytorch/ao/blob/242f181fe59e233b458740b06464ad42da8df6af/torchao/quantization/quant_api.py#L522>`__: the function that returns a function that converts weight of linear to int4 weight only quantized weight
+* `quantize_ <https://docs.pytorch.org/ao/main/generated/torchao.quantization.quantize_.html#torchao.quantization.quantize_>`__: the model level API that quantizes the weight of linear by applying the conversion function from user (second argument)
+* `Int4WeightOnlyConfig <https://docs.pytorch.org/ao/main/generated/torchao.quantization.Int4WeightOnlyConfig.html#torchao.quantization.Int4WeightOnlyConfig>`__: the function that returns a function that converts weight of linear to int4 weight only quantized weight
   * Calls quantization primitives ops like choose_qparams_affine and quantize_affine to quantize the Tensor
 * `TensorCoreTiledLayout <https://github.com/pytorch/ao/blob/242f181fe59e233b458740b06464ad42da8df6af/torchao/dtypes/affine_quantized_tensor.py#L573>`__: the tensor core tiled layout type, storing parameters for the packing format
 * `TensorCoreTiledAQTTensorImpl <https://github.com/pytorch/ao/blob/242f181fe59e233b458740b06464ad42da8df6af/torchao/dtypes/affine_quantized_tensor.py#L1376>`__: the tensor core tiled TensorImpl, stores the packed weight for efficient int4 weight only kernel (tinygemm kernel)
 
@@ -56,8 +56,8 @@ for efficient mixed dtype matrix multiplication:
 .. code:: py
 
   # torch 2.4+ only
-  from torchao.quantization import int4_weight_only, quantize_
-  quantize_(model, int4_weight_only(group_size=32))
+  from torchao.quantization import Int4WeightOnlyConfig, quantize_
+  quantize_(model, Int4WeightOnlyConfig(group_size=32))
 
 The quantized model is now ready to use! Note that the quantization
 logic is inserted through tensor subclasses, so there is no change
 
@@ -14,7 +14,7 @@ Here is the serialization and deserialization flow::
   from torchao.utils import get_model_size_in_bytes
   from torchao.quantization.quant_api import (
       quantize_,
-      int4_weight_only,
+      Int4WeightOnlyConfig,
   )
 
   class ToyLinearModel(torch.nn.Module):
@@ -36,7 +36,7 @@ Here is the serialization and deserialization flow::
   print(f"original model size: {get_model_size_in_bytes(m) / 1024 / 1024} MB")
 
   example_inputs = m.example_inputs(dtype=dtype, device="cuda")
-  quantize_(m, int4_weight_only())
+  quantize_(m, Int4WeightOnlyConfig())
   print(f"quantized model size: {get_model_size_in_bytes(m) / 1024 / 1024} MB")
 
   ref = m(*example_inputs)
@@ -70,7 +70,7 @@ quantized model ``state_dict``::
   {"linear1.weight": quantized_weight1, "linear2.weight": quantized_weight2, ...}
 
 
-The size of the quantized model is typically going to be smaller to the original floating point model, but it also depends on the specific techinque and implementation you are using. You can print the model size with ``torchao.utils.get_model_size_in_bytes`` utility function, specifically for the above example using int4_weight_only quantization, we can see the size reduction is around 4x::
+The size of the quantized model is typically going to be smaller to the original floating point model, but it also depends on the specific techinque and implementation you are using. You can print the model size with ``torchao.utils.get_model_size_in_bytes`` utility function, specifically for the above example using Int4WeightOnlyConfig quantization, we can see the size reduction is around 4x::
 
   original model size: 4.0 MB
   quantized model size: 1.0625 MB