Skip to content

Conversation

@MasterJH5574
Copy link
Contributor

PR #15327 and #15373 introduced multi-warp allreduce implementation. At the time of the introduction, I tested the correctness numerically via the workload of "taking a matrix of ones as input, computing the summation over each row". Both PR passed this numerical tess, while I didn't realize that this test is not complete and cannot guarantee the correctness.

The previous implementation has bug which can be tested by turning the input matrix from ones to random floating-point numbers. This will expose the issues of the previous implementation.

Therefore, this PR fixes the issues, and add the numerical tests for multi-warp allreduce into test_allreduce_cuda.py. By reducing some of the redundant tests in that file, we hope this can reduce the testing time a bit while still guarantee the correctness.

Sorry for not testing the implementation completely before.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Jul 25, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

PR apache#15327 and apache#15373 introduced multi-warp allreduce implementation.
At the time of the introduction, I tested the correctness numerically
via the workload of "taking a matrix of ones as input, computing the
summation over each row". Both PR passed this numerical tess, while
I didn't realize that this test is not complete and cannot guarantee
the correctness.

The previous implementation has bug which can be tested by turning
the input matrix from ones to random floating-point numbers. This will
expose the issues of the previous implementation.

Therefore, this PR fixes the issues, and add the numerical tests
for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing
some of the redundant tests in that file, we hope this can reduce the
testing time a bit while still guarantee the correctness.

Sorry for not testing the implementation completely before.
@tqchen tqchen merged commit 236eb31 into apache:main Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants