-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Support sub warp reduction for CUDA target. #10207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Do you have any performance results? Also please add testcases |
|
Sure, below is the measured time of the kernel: @T.prim_func
def reduce(a: T.handle, b: T.handle, n: T.int32) -> None:
A = T.match_buffer(a, [1048576, n])
B = T.match_buffer(b, [1048576])
for i, j in T.grid(1048576, n):
with T.block("reduce"):
vi, vj = T.axis.remap("SR", [i, j])
with T.init():
B[vi] = 0.
B[vi] = B[vi] + A[vi, vj]and change n between 2,4,8,16,32.
there is some variance across multiple runs. Time evaluated with TVM's native |
|
CC @MasterJH5574 I believe you are interested |
|
Some other notes: If in the following case: @T.prim_func
def reduce(a: T.handle, b: T.handle, n: T.int32) -> None:
A = T.match_buffer(a, [1, 4, 8])
B = T.match_buffer(b, [1, 4])
for i, j, k in T.grid(1, 4, 8):
with T.block("reduce"):
vi, vj, vk = T.axis.remap("SSR", [i, j, k])
with T.init():
B[vi, vj] = 0.
B[vi, vj] = B[vi, vj] + A[vi, vj, vk]we bind Below is an example of generated code: extern "C" __global__ void __launch_bounds__(32) default_function_kernel0(float* __restrict__ A, float* __restrict__ B) {
float red_buf0[1];
uint mask[1];
float t0[1];
red_buf0[(0)] = A[(((((int)threadIdx.y) * 8) + ((int)threadIdx.x)))];
mask[(0)] = (__activemask() & ((uint)(255 << (((int)threadIdx.y) * 8))));
t0[(0)] = __shfl_down_sync(mask[(0)], red_buf0[(0)], 4, 32);
red_buf0[(0)] = (red_buf0[(0)] + t0[(0)]);
t0[(0)] = __shfl_down_sync(mask[(0)], red_buf0[(0)], 2, 32);
red_buf0[(0)] = (red_buf0[(0)] + t0[(0)]);
t0[(0)] = __shfl_down_sync(mask[(0)], red_buf0[(0)], 1, 32);
red_buf0[(0)] = (red_buf0[(0)] + t0[(0)]);
red_buf0[(0)] = __shfl_sync(mask[(0)], red_buf0[(0)], (((int)threadIdx.y) * 8), 32);
B[(((int)threadIdx.y))] = red_buf0[(0)];
}Another thing worth noting is, we can only allow cross warp reduction by shuffle-down, thus warp size must be a multiple of |
3312a8f to
d873c00
Compare
|
Interesting. Looks like the perf improvement isn't very much? Only when
BTW do we have this requirement in the codebase now? |
My typo, I have fixed it. Another benefit of using shuffle-down is reducing the shared memory usage thus increasing the number of blocks can be executed concurrently. |
@MasterJH5574 yes there is a notion of |
d873c00 to
9bec35d
Compare
* upd * upd * upd * lint * fix * upd docstring * upd
* upd * upd * upd * lint * fix * upd docstring * upd
Previously the
LowerThreadAllReducepass will only emit code that usesshfl_downwhen reduce extent equals warp size, when reduce extent is less than warp size, the codegen fall back to emit code that uses shared memory, which is not efficient. Considering CUDA supports sub-warp reduction by specifying the mask, we can still use the shuffle-down approach for reduction by changing the mask.Example code:
Emitted code before this PR:
Emitted code after this PR:
Future work
CUDA 11 supports cooperative group reduction which we can directly use.
cc @vinx13 @junrushao1994