Skip to content

Conversation

@masahi
Copy link
Member

@masahi masahi commented Oct 26, 2022

Note: Most diffs are from test cases, which are bloated due to many TVMScript modules.

Building on the notion of "module equality" introduced in #13050, I'm adding a new variant of module equality based on the "anchor blocks". I defined the anchor block in #13194.

Currently, MS does tuning at the level of subgraphs. So for example in resnet where there are conv2d -> add -> relu and conv2d -> add -> add -> relu subgraphs, the two subgraphs are treated as distinct tuning tasks even if the anchor block conv2d workloads are identical. The new module equality will identify them as equal, so it reduces the number of tuning tasks if shorter tuning time is preferred over subgraph-level performance. Currently there is no dedicated API for anchor-block tuning - passing module_equality="anchor-block" to task extraction and tune_relay will enable it.

This is particularly effective for int8 models, since each conv2d / dense (anchor) op is quantized slightly differently, ending up many similar but not identical elemwise ops fused after the anchor blocks. On the int8 resnet50 model I tested, it reduced the number of conv2d tuning tasks from 36 to 23.

The interesting question is the performance difference between full subgraph-level tuning and anchor-block tuning. To experiment this, I tested the int8 resnet50 mentioned above, where anchor-based tuning makes the most difference in terms of the # of extracted tasks. The results are summarized below.

num_iter_per_task = 32, and max_iters_per_task = 128 in all cases.

Target Anchor tuning Subgraph tuning
x86 VNNI 4.45 4.58
Hexagon 58.4 58.1
CUDA tensor core (batch size 16) 6.0 6.7 (TODO: Try again)

From the model + target combinations I tested, I didn't see much perf difference beyond those from natural tuning flaky-ness. The tensorcore result is a bit weird and needs more investigation, but I found that tuning this model using int8 tensor core auto tensorization is incredibly slow currently (for example, getting the 6.7 result took 12 hours). There is also the correctness issue discussed in #13204. So I haven't done more experiments on tensorcore.

Applying the trace tuned on an anchor block to the target block

The tricky problem that this work address is the application of a trace, which is tuned on a "representative" anchor subgraph, to the target mod which has different post blocks. Note that, in the resnet example where there are conv2d -> add -> relu and conv2d -> add -> add -> relu subgraphs, we tune the "smaller" conv2d -> add -> relu subgraph, not just the pure anchor block conv2d.

So while applying a trace tuned on just conv2d to conv2d -> add would be trivial (the existing Trace::ApplyToSchedule would just work), in practice we would be applying a trace tuned on conv2d -> add to conv2d -> subtract, for example. My proposed solution is implemented in src/meta_schedule/trace_apply.cc and it is tested extensively in test_meta_schedule_trace_apply.py.

@junrushao @zxybazh @vinx13 @tkonolige

@tvm-bot
Copy link
Collaborator

tvm-bot commented Oct 26, 2022

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@masahi masahi marked this pull request as ready for review October 27, 2022 01:45
Copy link
Member

@junrushao junrushao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just amazing!! Thanks for the PR!

Copy link
Member

@Hzfengsy Hzfengsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @masahi for the great work!

Copy link
Member

@zxybazh zxybazh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from a few nits looks pretty good to me, very delicate design and comprehensive tests! Thanks @masahi!

@zxybazh zxybazh merged commit f42826e into apache:main Oct 28, 2022
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 10, 2022
* Introduce new module equality to extract only anchor block tasks

* enabling application of anchor trace to different subgraph

* fixed anchor block extraction

* fixed UB in task extraction

* Reworked anchor trace application and inlining logic

* fixed anchor block extraction for winograd

* fix inline logic for winograd

* refactor, clean up, renaming

* fix reverse compute inline unapplicable case

* fixed get_block applicablity condition

* adding test

* introduce HasBlock utility

* Decoupled trace creation and application in Trace::ApplyJSONToschedule

* add test

* adding more test

* black

* Revert "Decoupled trace creation and application in Trace::ApplyJSONToschedule"

This reverts commit 02df571.

* add tests

* add doc

* use anchor tuning in hexagon int8 tuning test

* cpplint

* suppress mypy on ffi

* add workaround for false positive maybe-uninitialized warning

* add a minimal anchor tuning test

* relax tol for i386, remove gpu test since it requires sm86

* add doc for "anchor-block" module equality

* address comments

* add test for cache_write + AllocateConst bug
junrushao pushed a commit that referenced this pull request Nov 21, 2022
Following #13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
MasterJH5574 added a commit to MasterJH5574/tlc-relax that referenced this pull request Nov 22, 2022
Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
MasterJH5574 added a commit to mlc-ai/relax that referenced this pull request Nov 22, 2022
)

Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
MasterJH5574 added a commit to MasterJH5574/tlc-relax that referenced this pull request Nov 22, 2022
Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
* Introduce new module equality to extract only anchor block tasks

* enabling application of anchor trace to different subgraph

* fixed anchor block extraction

* fixed UB in task extraction

* Reworked anchor trace application and inlining logic

* fixed anchor block extraction for winograd

* fix inline logic for winograd

* refactor, clean up, renaming

* fix reverse compute inline unapplicable case

* fixed get_block applicablity condition

* adding test

* introduce HasBlock utility

* Decoupled trace creation and application in Trace::ApplyJSONToschedule

* add test

* adding more test

* black

* Revert "Decoupled trace creation and application in Trace::ApplyJSONToschedule"

This reverts commit 02df571.

* add tests

* add doc

* use anchor tuning in hexagon int8 tuning test

* cpplint

* suppress mypy on ffi

* add workaround for false positive maybe-uninitialized warning

* add a minimal anchor tuning test

* relax tol for i386, remove gpu test since it requires sm86

* add doc for "anchor-block" module equality

* address comments

* add test for cache_write + AllocateConst bug
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
)

Following apache#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
Hzfengsy pushed a commit to mlc-ai/relax that referenced this pull request Dec 9, 2022
)

Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
MasterJH5574 added a commit to mlc-ai/relax that referenced this pull request Dec 14, 2022
)

Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
MasterJH5574 added a commit to mlc-ai/relax that referenced this pull request Dec 24, 2022
)

Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
tqchen pushed a commit to mlc-ai/relax that referenced this pull request Dec 30, 2022
)

Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
MasterJH5574 added a commit to mlc-ai/relax that referenced this pull request Jan 9, 2023
)

Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
MasterJH5574 added a commit to mlc-ai/relax that referenced this pull request Jan 12, 2023
)

Following apache/tvm#13206, this PR brings the new parameter added to the AutoBind schedule rule to Python side.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants