-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Metaschedule] Auto tensorization for CPU / GPU dot product #11088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if (Optional<String> intrin_name = | ||
| tir::GetAnn<String>(block_sref, tir::attr::meta_schedule_auto_tensorize)) { | ||
| std::string block_name = block_sref->StmtAs<tir::BlockNode>()->name_hint; | ||
| if (block_name.find("init") == std::string::npos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DecomposeReduction applied before this postproc copies meta_schedule_auto_tensorize attributes to the init block as well. So we need to make sure that we won't try to tensorize a block even if it has meta_schedule_auto_tensorize annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are target-specific handling here, ideally we can make the init block behavior configurable in meta schedule rule, it is fine for now
| ICHECK(child_blocks.size() == 1); | ||
| Array<LoopRV> init_loops = sch->GetLoops(child_blocks[0]); | ||
| ICHECK(init_loops.size() == 1); | ||
| sch->Vectorize(init_loops[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to above, since DecomposeReduction introduces a new loop that should be vectorized on CPU, for now I'm applying vecotorization to the decomposed init loop here. This can also be done in RewriteReductionBlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does postproc::RewriteParallelVectorizeUnroll for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope it would, but it doesn't. Also since parallelization etc is supposed to be applied before DecomposeReduction, I don't think running RewriteParallelVectorizeUnroll after RewriteReductionBlock() is a good idea. So vectorization of the init loop has to be done manually somehow.
I'd prefer vectoring in the init loop right after we run DecomposeReduction during RewriteReductionBlock, since vecotorization of the init loop should be done on CPU regardless of tensorization. cc @MasterJH5574
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! What’s the order of post-processors being applied now? Perhaps we should reflect this order by adding this post-processor to tune.py
tvm/python/tvm/meta_schedule/tune.py
Lines 159 to 170 in effc23d
| @staticmethod | |
| def _postproc() -> List[Postproc]: | |
| from tvm.meta_schedule import postproc as M | |
| return [ | |
| M.DisallowDynamicLoop(), | |
| M.RewriteCooperativeFetch(), | |
| M.RewriteUnboundBlock(), | |
| M.RewriteParallelVectorizeUnroll(), | |
| M.RewriteReductionBlock(), | |
| M.VerifyGPUCode(), | |
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue in question is vectorization for CPU targets. I'm using the default postprocs in
tvm/python/tvm/meta_schedule/tune.py
Lines 96 to 103 in effc23d
| def _postproc() -> List[Postproc]: | |
| from tvm.meta_schedule import postproc as M | |
| return [ | |
| M.DisallowDynamicLoop(), | |
| M.RewriteParallelVectorizeUnroll(), | |
| M.RewriteReductionBlock(), | |
| ] |
Since loop parallelization or vectorization checks for the "compact dataflow" constraint,
tvm/src/tir/schedule/primitive/for_kind.cc
Line 160 in 0ddaaa6
| CheckSubtreeCompactDataflow(self, loop_sref); |
DecomposeReduction in RewriteReductionBlock(). So having RewriteParallelVectorizeUnroll before RewriteReductionBlock() in the default postprocs makes sense.
However, this is not sufficient to vectorize the init loop of reduction block, since it is generated during RewriteReductionBlock(). I don't think we should run RewriteParallelVectorizeUnroll again after RewriteReductionBlock() (and it doesn't work anyway), so we need to manually vectorize the decomposed init loop in RewriteReductionBlock or the new RewriteTensorize postproc I added. I prefer the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I want to tensorize the reduction block. So before DecomposeReduction is called, the loop kind of the reduction is serial, which makes the decomposed init loop be serial as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So the block we want to tensorize wasn’t applied by the schedule rule ParallelVectorizeUnroll as well 🤔?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes (otherwise tensorize pattern matching fails, because an intrin desc is always serial), I'm not exactly sure what prevents ParallelVectorizeUnroll from tampering the block we want to tensorize (which is a good thing), maybe Blockize I do at
| tir::BlockRV outer_block = sch->Blockize(tiled_loop_rv.value()); |
(after tiling the inner loop nests to be tensorized) is helping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite interesting.. So here the case is, on one hand we don’t want the block being annotated by rule ParallelVectorizeUnroll, but on the other hand we do want its init block to be vectorized after the decomposition. Am I right?
Since before decomposition the block wasn’t annotated by ParallelVectorizeUnroll, the decomposed init block isn’t vectorized, which makes sense. In addition, the decomposed init block doesn’t have any information to indicate that it’s supposed to vectorized (e.g., it doesn’t have an “need vectorization” annotation). In this case, no matter we vectorize the init block loop in RewriteReductionBlock or RewriteTensorize, it’s all due to our human knowledge, which I don’t think is perfect.
For upstreaming, it might be okay to do manual vectorization in RewriteTensorize (how does the vectorization in RewriteTensorize bypass the compact dataflow issue BTW?). But in the long term I suppose we should enhance the compact dataflow check to allow such vectorization. After all, such vectorization won’t incur any incorrectness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite interesting.. So here the case is, on one hand we don’t want the block being annotated by rule ParallelVectorizeUnroll, but on the other hand we do want its init block to be vectorized after the decomposition. Am I right?
Exactly.
how does the vectorization in RewriteTensorize bypass the compact dataflow issue BTW?
That's a great question! Until recently, vectorization of the init loop after DecomposeReduction was rejected by the compact dataflow check. I brought this topic to @Hzfengsy and the team came up with a relaxation of the constraint that allows vectorizing init loop. This is the PR #10705
Yeah, the ideally all outer loop parallelizations and inner loop vectorization can be done by one pass of ParallelVectorizeUnroll, meaning we run it after DecomposeReduction. Currently outer loop parallelization after DecomposeReduction would be rejected by the compact dataflow check, but I think this is still too restrictive.
0d9e476 to
71d9ab5
Compare
|
I'm super excited to see this PR!! Would love to have some helping hands review this PR :-) CC: @vinx13 @spectrometerHBH |
|
Some perf numbers on int8 VNNI, rocketlake 6 core RTX 3070 with DP4A (FP32 peak around 16 TFLOPS) AMDGPU RX6600xt with DP4A (FP32 peak around 10 TFLOPS) |
MasterJH5574
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the efforts! Excited to see auto-tensorization happening!
MasterJH5574
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the list of post-processors here as well?
tvm/include/tvm/meta_schedule/postproc.h
Line 110 in effc23d
| class Postproc : public runtime::ObjectRef { |
6d6c3b4 to
e104593
Compare
Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Wuwei Lin <[email protected]>
e104593 to
3d773f9
Compare
…1088) * [Metaschedule] Auto-tensorization for CPU / GPU dot product Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> * doc update * add vnni conv2d test * add dp4a test * adding tests for rewrite_tensorize * add rewrite_tensorize test * add missing pydoc * black * more doc * adding auto tensorize integration test * add dp4a test * fix target name * fix dtype in test * skip bert test * replace hard-coded llvm intrinsic id in test with look up * remove unnecessary include, add doc for the rest of params * update postproc.h * update doc * fix shape in te matmul workload * fix newline in cppdoc Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Wuwei Lin <[email protected]>
Building on #11075, add
MultiLevelTilingWithIntrinschedule rule andRewriteTensorizepostproc, which can be used for auto-tensorization with a single intrinsic, such as CPU / GPU dot product. This is the simplistic but non-trivial use of auto tensorization.The diff looks large but most of them are boilerplate from tests. The actual change to enable auto tensorization is about 300 lines.
MultiLevelTilingWithIntrincan be used to auto-tensorize schedules with the following intrinsics. We should be able to deprecate corresponding manual templates in AutoTVM, but detail perf analysis is yet to be done.sdot) (cc @tkonolige)dp4afor cuda, SPIRV integer dot product for vulkan, and AMDGPU gfx10sdot4for rocm.As a demonstration, I've add integration tests in
tests/python/integration/test_meta_schedule_auto_tensorize.py, one of which is E2E auto-tensorzation on quantizedbert-basex {VNNI, DP4A}. DP4A tests can also run on AMDGPU via vulkan or rocm backends (@mei-ye @tmoreau89).Co-authored-by: Siyuan Feng [email protected]
Co-authored-by: Bohan Hou [email protected]
Co-authored-by: Hongyi Jin [email protected]
Co-authored-by: Ruihang Lai [email protected]
Co-authored-by: Wuwei Lin [email protected]
@junrushao1994 @vinx13 @comaniac @mbrookhart @spectrometerHBH @Hzfengsy @MasterJH5574 @jinhongyii