You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix test_barrier hang by using static global rank in ProcessGroupXCCL (#2036)
Fixes#1978
In ProcessGroupNCCL, `globalRank()` returns a static int globalRank,
which is first initialized by the ProcessGroup setup, so the globalRank
assigned to each thread matches the id of the CUDA device. However, we
were not using this same pattern for XCCL. Instead, we were just using
the assigned rank of the thread, which were not necessarily the same as
the globalRank.
The failing test `test_barrier` created two separate groups of 2 ranks
each, and then 4 threads called barrier, but all on the same 2-thread
group. Since initially the device id is not specified in this barrier
call, the thread attempts to "guess" the device index. In the previous
code, this guess would be 0 or 1, since the rank of each thread was not
actually the globalRank. In `barrier`, this guessed id was used to
initialize XCCLComm objects, and then call allreduce on a single element
tensor. However, this resulted in an allreduce call two times on each
device, which could result in a hang based on the execution order of the
4 threads.
With the update in this PR, PGXCCL now uses the static globalRank in the
same places as ProcessGroupNCCL, so the initialized XCCLComm objects are
for unique devices and allreduce doesn't call on the same device
multiple times.
0 commit comments