Skip to content

[RFC][Quantization] Support quantized models from TensorflowLite #2351

@FrozenGene

Description

@FrozenGene

Let me reference @ajtulloch 's comment about quantization workflow firstly:

  1. Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.

  2. (optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.

  3. Train the model as usual

  4. Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by

  • calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)
  • using activation ranges learned during training (c2/tf).
  1. Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable.

  2. Deploy the quantized graph.

However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.

In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.

I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of #2116, it is just a supplement for TVM's quantization.

After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.

Welcome any feedback.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions