-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[TOPI] VNNI support for batch matmul #10332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could you please share of float batch_matmul perf data vs new introducing int8 batch_matmul? |
elvin-n
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Ok here is the comparison of GOPS between the new VNNI impl and the existing generic code. Also note that the VNNI numbers were obtained after only 1 or 2 min of tuning while the generic ones have very large tuning space and it took more than 12 hours to get these numbers under the same tuning option. The script is at https://github.com/masahi/int8_experiment/blob/main/relay_bench.py This is on a rocket lake
|
a7c0e72 to
4d6c9bb
Compare
tmoreau89
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @masahi the speedups you've reported are extremely impressive! LGTM
4d6c9bb to
d546384
Compare
* add test * compute added * schedule works * reuse dense_vnni schedule * try an alternative approach to scheduling layout transform * introduce a tunable knob to decide if compute_root * check transpose condition * support s8 + s8 input * pylint
Following #10230, I added VNNI support for
batch_matmulas well. The cool part is that I reuse the samedenseschedule in #10230 to schedule the GEMM part, and parallelize over the batch dimension. See the perf result in #10332 (comment)After this PR, I'll add(UPDATE: Done) - that will allow us to benchmark e2e performance on QAT BERT made possible by @Icemist in #10239.int8, int8support to VNNIdenseandbatch_matmulUnlike
densecase, the second input tobatch_matmulis typically not a constant tensor. So I don't usealter_layoutand compile time layout transform. Instead, layout transform is done at runtime. So the lowered IR forbatch_matmul+ post ops looks like:Future work can explore possibilities for eliminating runtime layout transform, or pipelining layout transform and compute to hide the overhead.
@elvin-n @mbrookhart @tkonolige @junrushao1994 @vinx13