[CUDA] Simple extend to optimize reuse for static shared memory. #16342

LeiWang1999 · 2024-01-03T17:14:01Z

#759 proposed a pass storage_rewrite and provided a trivial storage reuse plan based on liveness analysis, just as #9341 mentioned, the solution has some limitations:

storage_rewrite can't handle buffer with different dtypes.

int8 A_shared[32];
int8 B_shared[32];
int32 C_shared[4]; // will not be reused even we have enough workspace as different types.

storage_rewrite can't allocate a buffer in the place of another 2 buffers.

   int8 A_shared[32];
   int8 B_shared[32];
   int8 C_shared[64]; 
  // will be reused as A_shared[32], B_shared[64],  results in 32 half elements space waste.

#8571 and #9341 introduced a pass MergeDynamicSharedMemoryAllocations, which can support efficient memory reuse solely for dynamic shared memory. However, sometimes we do not want to use dynamic shared memory for codegen, so this pull request made a simple extend to MergeDynamicSharedMemoryAllocations to support both dynamic and static shared memory optimal reuse.

By default, the static shared memory merge is disabled to maintain consistency, to enable the static part:

with tvm.transform.PassContext(config={"tir.merge_static_smem": True}):
    cuda_mod = tvm.build(sch.mod, target="cuda")

Take int8xint8=int32 tensorcore gemm as an example, we have a big tile and used static shared memory, before the pass:

__global__ void __launch_bounds__(128) Fused(int8_t* __restrict__ input0, int8_t* __restrict__ input1, int* __restrict__ output0) {
  
  int mediate0_shared_warp[128];
  __shared__ signed char input0_shared[16384];
  __shared__ signed char input1_shared[16384];
  signed char input0_shared_warp[64];
  signed char input1_shared_warp[64];
  signed char input0_shared_warp_1[64];
  signed char input1_shared_warp_1[64];
  __shared__ int mediate0_shared[6400];

it will exceed the maximum available static shared memory, and compilation will fail. After this pass

__global__ void __launch_bounds__(128) Fused(int8_t* __restrict__ input0, int8_t* __restrict__ input1, int* __restrict__ output0) {
  
  __shared__ uchar buf_shmem[32768];
  int mediate0_shared_warp[128];
  signed char input0_shared_warp[64];
  signed char input1_shared_warp[64];
  signed char input0_shared_warp_1[64];
  signed char input1_shared_warp_1[64];

we can save around 50% shared memory and the compilation can pass, code generation perf with fastdlight can achieve 510+Tflops (without the pass, the best tile is around 420TFlops on A100), this pass will enable us to explore more tile configs under static shared memory.

Moreover, the pass can optimize the dynamic shared memory plan as well, as the storage_rewrite pass will merge C_shared to B_shared in this example, which is not friendly for further memory plan analysis, the flag merge_static_smem will disable the trivial reuse behavior by (don't know if the flag can be improved):

if (!enable_reuse || is_small_array || !is_flat_memory_space) {
  return NewAlloc(op, attach_scope, scope, const_nbits);
}

junrushao · 2024-01-04T18:00:59Z

This is really amazing addition! Particularly, I found it painful that existing storage-rewrite pass doesn't handle heterogeneous dtypes (which CUDA does support casting in-between), and also on the second point, yes it's limited by current rewriting - thanks for the contribution!

junrushao · 2024-01-04T18:01:47Z

tests/python/tir-transform/test_tir_transform_merge_static_shared_memory_allocations.py

+
+def run_passes(sch, args):
+    mod = schedule_to_module(sch, args)
+    with tvm.transform.PassContext(config={"tir.merge_static_smem": True}):


In which case should we turn this flag off?

tir.merge_static_smem is set to False by default to ensure the code is more readable. (for example, maintain clearly definition of A_shared, B_shared, instead of (half*)(buf_shmem+ offset)) , so it should be manually enabled.

Got it - so it's turned off basically for better readability, is my understanding correct?

junrushao

Overall LGTM!

jinhongyii · 2024-01-05T19:25:42Z

It seems that this PR is not merged in squash mode.

vinx13 · 2024-01-05T19:56:43Z

oops sorry this is a mistake

masahi · 2024-01-05T21:58:55Z

This PR should have been sent to main.

LeiWang1999 added 5 commits January 3, 2024 12:25

enhance shared memory merge.

a99ed2d

merge from unity upstream

db27950

revert the change for dyanmic test

16ca25a

fix typo

bf3003e

lint fix

a9ca5a8

junrushao reviewed Jan 4, 2024

View reviewed changes

junrushao approved these changes Jan 4, 2024

View reviewed changes

vinx13 merged commit e3216a6 into apache:unity Jan 5, 2024

ysh329 mentioned this pull request Apr 21, 2024

[Release] v0.16.0 Release Candidate Notes #16911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] Simple extend to optimize reuse for static shared memory. #16342

[CUDA] Simple extend to optimize reuse for static shared memory. #16342

Uh oh!

LeiWang1999 commented Jan 3, 2024 •

edited

Loading

Uh oh!

junrushao commented Jan 4, 2024

Uh oh!

junrushao Jan 4, 2024

Uh oh!

LeiWang1999 Jan 5, 2024

Uh oh!

junrushao Jan 5, 2024

Uh oh!

junrushao left a comment

Uh oh!

jinhongyii commented Jan 5, 2024

Uh oh!

vinx13 commented Jan 5, 2024

Uh oh!

masahi commented Jan 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[CUDA] Simple extend to optimize reuse for static shared memory. #16342

[CUDA] Simple extend to optimize reuse for static shared memory. #16342

Uh oh!

Conversation

LeiWang1999 commented Jan 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junrushao commented Jan 4, 2024

Uh oh!

junrushao Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 Jan 5, 2024

Choose a reason for hiding this comment

Uh oh!

junrushao Jan 5, 2024

Choose a reason for hiding this comment

Uh oh!

junrushao left a comment

Choose a reason for hiding this comment

Uh oh!

jinhongyii commented Jan 5, 2024

Uh oh!

vinx13 commented Jan 5, 2024

Uh oh!

masahi commented Jan 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LeiWang1999 commented Jan 3, 2024 •

edited

Loading