[RFC] Quantization Workflow

## Goal

Here are two feasible approaches to support running quantized model with TVM:
- Get quantized model from other frontend frameworks like TF, C2, MX. We need to add the support for low-bit kernels and operators like `quantize`/`dequantize` with TVM, then transform the quantized graph directly. 
 - Implement the quantization algorithm based on Relay. Take over the quantization procedure with Relay. This approach also requires the support for low-bit TVM kernels.

Actually, these two methods are not contradictory and we can achieve both.  The issue is whether the second approach is necessary and worth the extra effort.

The problem is that different hardwares may have different constraints: we may have different choices for bits, and hardware may only support shift. We also have multiple choices for quantization schemes, like symmetric, asymmetric, etc. And we want to make this procedure easier and more flexible for hardware developers, based on Relay and TVM. Again, what we want to do is not to propose "the only right way to achieve quantization in TVM". What we want to achieve is to propose a workflow that can be flexibly customized for different hardwares and different quantize scheme. And signed symmetric quantization is just one demo for this workflow.

## Current Design

The current quantization workflow is composed of three passes on the Relay IR.

### Annotate
Given a float32 graph, it will return a graph back which simulates the error brought by current quantization scheme. The implementation centers around rewrite function of each operator and a `simulated quantize` op. Let us review the definition of `simulated_quantize` first:

```python
def simualted_quantize(data, dom_scale, nbit, clip_min, clip_max, sign=True, rounding=’round’):
    """simulating the rounding error and saturate error"""
    scaled_data = data / dom_scale
    # select round scheme `round`/`floor`/`ceil`/`statistical_round` according to attribute `rounding`
    round_data = round(scaled_data)
    clipped_data = clip(round_data, clip_min, clip_max)
    # recover the data
    ret_data = clipped_data * dom_scale
    return ret_data
```
  

For every operator, it can register an `AnnoateRewrite` function, which rewrite the operator in the original graph. For example, it will rewrite a subgraph `data->conv->` to `data->sq->conv->`. It can be overrided by users for different quantizaiton scheme.

```python
# a pseudo naive example for registering a rewrite function for conv2d
@register_annotate_rewrite("nn.conv2d")
def conv2d_rewrite(ref_call, new_args, ctx):
    lhs, rhs = new_args
    lhs = attach_simulated_quantize(lhs, sign=True, rounding='round')
    rhs = attach_simulated_quantize(rhs, sign=True, rounding='round')
    return conv2d(lhs, rhs, ref_call.attrs)
```

### Calibrate
The `calibrate` procedure will try to calculate the content of `dom_scale`, `nbit`, `clip_min`, `clip_max` for every `simulated_quantize` operator. Currently, we use a quite naive approach, setting them with the upper/lower bound which default bit setting allows. There are lots of spaces to explore how to set those fields smartly here.

### Realize

The `realize` pass will transform the simulated quantized graph, which computes with float32 actually, to a real low-bit integer graph. It will replace the `simulated_quantize` with several fine-grained operators like `add`, `multiply`, and `shift` as more as possible for performance (fusion, etc.)

## Demonstration

This workflow should be able to support different choices in terms of number of bits and quantization scheme. Just need to override the registered `AnnotateRewrite` function for each operator.

(TODO)
### Support for different bits 
- i8->i32
- i16->i32
- i8->i24
- i5->i16

### Support for different quantization scheme
-   Symmetric
-   Asymmetric
-   Channel-wise Scale



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Quantization Workflow #2259

Goal

Current Design

Annotate

Calibrate

Realize

Demonstration

Support for different bits

Support for different quantization scheme

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Quantization Workflow #2259

Description

Goal

Current Design

Annotate

Calibrate

Realize

Demonstration

Support for different bits

Support for different quantization scheme

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions