-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Let me reference @ajtulloch 's comment about quantization workflow firstly:
Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.
(optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.
Train the model as usual
Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by
- calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)
- using activation ranges learned during training (c2/tf).
Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable.
Deploy the quantized graph.
However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.
In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.
I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of #2116, it is just a supplement for TVM's quantization.
After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.
-
Support TFLite FP32 Relay frontend. PR: [TFLite] Support TFLite FP32 Relay frontend. #2365
-
Support TFLite INT8 Relay frontend
-
Extend the attribute of the convolution and related ops to support quantization
-
Auto-TVM on ARM CPU can work with INT8
Welcome any feedback.