-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Goal
Here are two feasible approaches to support running quantized model with TVM:
- Get quantized model from other frontend frameworks like TF, C2, MX. We need to add the support for low-bit kernels and operators like
quantize/dequantizewith TVM, then transform the quantized graph directly. - Implement the quantization algorithm based on Relay. Take over the quantization procedure with Relay. This approach also requires the support for low-bit TVM kernels.
Actually, these two methods are not contradictory and we can achieve both. The issue is whether the second approach is necessary and worth the extra effort.
The problem is that different hardwares may have different constraints: we may have different choices for bits, and hardware may only support shift. We also have multiple choices for quantization schemes, like symmetric, asymmetric, etc. And we want to make this procedure easier and more flexible for hardware developers, based on Relay and TVM. Again, what we want to do is not to propose "the only right way to achieve quantization in TVM". What we want to achieve is to propose a workflow that can be flexibly customized for different hardwares and different quantize scheme. And signed symmetric quantization is just one demo for this workflow.
Current Design
The current quantization workflow is composed of three passes on the Relay IR.
Annotate
Given a float32 graph, it will return a graph back which simulates the error brought by current quantization scheme. The implementation centers around rewrite function of each operator and a simulated quantize op. Let us review the definition of simulated_quantize first:
def simualted_quantize(data, dom_scale, nbit, clip_min, clip_max, sign=True, rounding=’round’):
"""simulating the rounding error and saturate error"""
scaled_data = data / dom_scale
# select round scheme `round`/`floor`/`ceil`/`statistical_round` according to attribute `rounding`
round_data = round(scaled_data)
clipped_data = clip(round_data, clip_min, clip_max)
# recover the data
ret_data = clipped_data * dom_scale
return ret_dataFor every operator, it can register an AnnoateRewrite function, which rewrite the operator in the original graph. For example, it will rewrite a subgraph data->conv-> to data->sq->conv->. It can be overrided by users for different quantizaiton scheme.
# a pseudo naive example for registering a rewrite function for conv2d
@register_annotate_rewrite("nn.conv2d")
def conv2d_rewrite(ref_call, new_args, ctx):
lhs, rhs = new_args
lhs = attach_simulated_quantize(lhs, sign=True, rounding='round')
rhs = attach_simulated_quantize(rhs, sign=True, rounding='round')
return conv2d(lhs, rhs, ref_call.attrs)Calibrate
The calibrate procedure will try to calculate the content of dom_scale, nbit, clip_min, clip_max for every simulated_quantize operator. Currently, we use a quite naive approach, setting them with the upper/lower bound which default bit setting allows. There are lots of spaces to explore how to set those fields smartly here.
Realize
The realize pass will transform the simulated quantized graph, which computes with float32 actually, to a real low-bit integer graph. It will replace the simulated_quantize with several fine-grained operators like add, multiply, and shift as more as possible for performance (fusion, etc.)
Demonstration
This workflow should be able to support different choices in terms of number of bits and quantization scheme. Just need to override the registered AnnotateRewrite function for each operator.
(TODO)
Support for different bits
- i8->i32
- i16->i32
- i8->i24
- i5->i16
Support for different quantization scheme
- Symmetric
- Asymmetric
- Channel-wise Scale