Skip to content

[RFC] Quantization Workflow #2259

@ZihengJiang

Description

@ZihengJiang

Goal

Here are two feasible approaches to support running quantized model with TVM:

  • Get quantized model from other frontend frameworks like TF, C2, MX. We need to add the support for low-bit kernels and operators like quantize/dequantize with TVM, then transform the quantized graph directly.
  • Implement the quantization algorithm based on Relay. Take over the quantization procedure with Relay. This approach also requires the support for low-bit TVM kernels.

Actually, these two methods are not contradictory and we can achieve both. The issue is whether the second approach is necessary and worth the extra effort.

The problem is that different hardwares may have different constraints: we may have different choices for bits, and hardware may only support shift. We also have multiple choices for quantization schemes, like symmetric, asymmetric, etc. And we want to make this procedure easier and more flexible for hardware developers, based on Relay and TVM. Again, what we want to do is not to propose "the only right way to achieve quantization in TVM". What we want to achieve is to propose a workflow that can be flexibly customized for different hardwares and different quantize scheme. And signed symmetric quantization is just one demo for this workflow.

Current Design

The current quantization workflow is composed of three passes on the Relay IR.

Annotate

Given a float32 graph, it will return a graph back which simulates the error brought by current quantization scheme. The implementation centers around rewrite function of each operator and a simulated quantize op. Let us review the definition of simulated_quantize first:

def simualted_quantize(data, dom_scale, nbit, clip_min, clip_max, sign=True, rounding=round’):
    """simulating the rounding error and saturate error"""
    scaled_data = data / dom_scale
    # select round scheme `round`/`floor`/`ceil`/`statistical_round` according to attribute `rounding`
    round_data = round(scaled_data)
    clipped_data = clip(round_data, clip_min, clip_max)
    # recover the data
    ret_data = clipped_data * dom_scale
    return ret_data

For every operator, it can register an AnnoateRewrite function, which rewrite the operator in the original graph. For example, it will rewrite a subgraph data->conv-> to data->sq->conv->. It can be overrided by users for different quantizaiton scheme.

# a pseudo naive example for registering a rewrite function for conv2d
@register_annotate_rewrite("nn.conv2d")
def conv2d_rewrite(ref_call, new_args, ctx):
    lhs, rhs = new_args
    lhs = attach_simulated_quantize(lhs, sign=True, rounding='round')
    rhs = attach_simulated_quantize(rhs, sign=True, rounding='round')
    return conv2d(lhs, rhs, ref_call.attrs)

Calibrate

The calibrate procedure will try to calculate the content of dom_scale, nbit, clip_min, clip_max for every simulated_quantize operator. Currently, we use a quite naive approach, setting them with the upper/lower bound which default bit setting allows. There are lots of spaces to explore how to set those fields smartly here.

Realize

The realize pass will transform the simulated quantized graph, which computes with float32 actually, to a real low-bit integer graph. It will replace the simulated_quantize with several fine-grained operators like add, multiply, and shift as more as possible for performance (fusion, etc.)

Demonstration

This workflow should be able to support different choices in terms of number of bits and quantization scheme. Just need to override the registered AnnotateRewrite function for each operator.

(TODO)

Support for different bits

  • i8->i32
  • i16->i32
  • i8->i24
  • i5->i16

Support for different quantization scheme

  • Symmetric
  • Asymmetric
  • Channel-wise Scale

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions