-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[MetaSchedule] Enable anchor-block tuning #13206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…oschedule" This reverts commit 02df571.
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
9bae181 to
5e2367a
Compare
junrushao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just amazing!! Thanks for the PR!
Hzfengsy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @masahi for the great work!
zxybazh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from a few nits looks pretty good to me, very delicate design and comprehensive tests! Thanks @masahi!
* Introduce new module equality to extract only anchor block tasks * enabling application of anchor trace to different subgraph * fixed anchor block extraction * fixed UB in task extraction * Reworked anchor trace application and inlining logic * fixed anchor block extraction for winograd * fix inline logic for winograd * refactor, clean up, renaming * fix reverse compute inline unapplicable case * fixed get_block applicablity condition * adding test * introduce HasBlock utility * Decoupled trace creation and application in Trace::ApplyJSONToschedule * add test * adding more test * black * Revert "Decoupled trace creation and application in Trace::ApplyJSONToschedule" This reverts commit 02df571. * add tests * add doc * use anchor tuning in hexagon int8 tuning test * cpplint * suppress mypy on ffi * add workaround for false positive maybe-uninitialized warning * add a minimal anchor tuning test * relax tol for i386, remove gpu test since it requires sm86 * add doc for "anchor-block" module equality * address comments * add test for cache_write + AllocateConst bug
Following #13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
) Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
* Introduce new module equality to extract only anchor block tasks * enabling application of anchor trace to different subgraph * fixed anchor block extraction * fixed UB in task extraction * Reworked anchor trace application and inlining logic * fixed anchor block extraction for winograd * fix inline logic for winograd * refactor, clean up, renaming * fix reverse compute inline unapplicable case * fixed get_block applicablity condition * adding test * introduce HasBlock utility * Decoupled trace creation and application in Trace::ApplyJSONToschedule * add test * adding more test * black * Revert "Decoupled trace creation and application in Trace::ApplyJSONToschedule" This reverts commit 02df571. * add tests * add doc * use anchor tuning in hexagon int8 tuning test * cpplint * suppress mypy on ffi * add workaround for false positive maybe-uninitialized warning * add a minimal anchor tuning test * relax tol for i386, remove gpu test since it requires sm86 * add doc for "anchor-block" module equality * address comments * add test for cache_write + AllocateConst bug
) Following apache#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
) Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
) Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
) Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
) Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
) Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
) Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
Note: Most diffs are from test cases, which are bloated due to many TVMScript modules.
Building on the notion of "module equality" introduced in #13050, I'm adding a new variant of module equality based on the "anchor blocks". I defined the anchor block in #13194.
Currently, MS does tuning at the level of subgraphs. So for example in resnet where there are
conv2d -> add -> reluandconv2d -> add -> add -> relusubgraphs, the two subgraphs are treated as distinct tuning tasks even if the anchor block conv2d workloads are identical. The new module equality will identify them as equal, so it reduces the number of tuning tasks if shorter tuning time is preferred over subgraph-level performance. Currently there is no dedicated API for anchor-block tuning - passingmodule_equality="anchor-block"to task extraction andtune_relaywill enable it.This is particularly effective for int8 models, since each conv2d / dense (anchor) op is quantized slightly differently, ending up many similar but not identical elemwise ops fused after the anchor blocks. On the int8 resnet50 model I tested, it reduced the number of conv2d tuning tasks from 36 to 23.
The interesting question is the performance difference between full subgraph-level tuning and anchor-block tuning. To experiment this, I tested the int8 resnet50 mentioned above, where anchor-based tuning makes the most difference in terms of the # of extracted tasks. The results are summarized below.
num_iter_per_task = 32, andmax_iters_per_task = 128in all cases.From the model + target combinations I tested, I didn't see much perf difference beyond those from natural tuning flaky-ness. The tensorcore result is a bit weird and needs more investigation, but I found that tuning this model using int8 tensor core auto tensorization is incredibly slow currently (for example, getting the 6.7 result took 12 hours). There is also the correctness issue discussed in #13204. So I haven't done more experiments on tensorcore.
Applying the trace tuned on an anchor block to the target block
The tricky problem that this work address is the application of a trace, which is tuned on a "representative" anchor subgraph, to the target mod which has different post blocks. Note that, in the resnet example where there are
conv2d -> add -> reluandconv2d -> add -> add -> relusubgraphs, we tune the "smaller"conv2d -> add -> relusubgraph, not just the pure anchor blockconv2d.So while applying a trace tuned on just
conv2dtoconv2d -> addwould be trivial (the existingTrace::ApplyToSchedulewould just work), in practice we would be applying a trace tuned onconv2d -> addtoconv2d -> subtract, for example. My proposed solution is implemented insrc/meta_schedule/trace_apply.ccand it is tested extensively intest_meta_schedule_trace_apply.py.@junrushao @zxybazh @vinx13 @tkonolige