-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[CUDA] Simple extend to optimize reuse for static shared memory. #16342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is really amazing addition! Particularly, I found it painful that existing storage-rewrite pass doesn't handle heterogeneous dtypes (which CUDA does support casting in-between), and also on the second point, yes it's limited by current rewriting - thanks for the contribution! |
|
|
||
| def run_passes(sch, args): | ||
| mod = schedule_to_module(sch, args) | ||
| with tvm.transform.PassContext(config={"tir.merge_static_smem": True}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case should we turn this flag off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tir.merge_static_smem is set to False by default to ensure the code is more readable. (for example, maintain clearly definition of A_shared, B_shared, instead of (half*)(buf_shmem+ offset)) , so it should be manually enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - so it's turned off basically for better readability, is my understanding correct?
junrushao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM!
|
It seems that this PR is not merged in squash mode. |
|
oops sorry this is a mistake |
|
This PR should have been sent to |
#759 proposed a pass
storage_rewriteand provided a trivial storage reuse plan based on liveness analysis, just as #9341 mentioned, the solution has some limitations:#8571 and #9341 introduced a pass
MergeDynamicSharedMemoryAllocations, which can support efficient memory reuse solely for dynamic shared memory. However, sometimes we do not want to use dynamic shared memory for codegen, so this pull request made a simple extend toMergeDynamicSharedMemoryAllocationsto support both dynamic and static shared memory optimal reuse.By default, the static shared memory merge is disabled to maintain consistency, to enable the static part:
Take int8xint8=int32 tensorcore gemm as an example, we have a big tile and used static shared memory, before the pass:
it will exceed the maximum available static shared memory, and compilation will fail. After this pass
we can save around 50% shared memory and the compilation can pass, code generation perf with fastdlight can achieve 510+Tflops (without the pass, the best tile is around 420TFlops on A100), this pass will enable us to explore more tile configs under static shared memory.
Moreover, the pass can optimize the dynamic shared memory plan as well, as the storage_rewrite pass will merge C_shared to B_shared in this example, which is not friendly for further memory plan analysis, the flag
merge_static_smemwill disable the trivial reuse behavior by (don't know if the flag can be improved):