-
Notifications
You must be signed in to change notification settings - Fork 59
Fix test_barrier hang by using static global rank in ProcessGroupXCCL #2036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please wait on merging, I want to confirm behavior in scale-out test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a hang in the test_barrier
test by making ProcessGroupXCCL use static global ranks consistently, matching the pattern used in ProcessGroupNCCL. The issue occurred when multiple threads used the same device ID for barrier operations, causing duplicate allreduce calls that could hang based on execution order.
Key changes:
- Added
globalRank()
method that returns a static global rank value - Updated device ID guessing logic to use global rank instead of thread rank
- Updated logging and debugging to reference global rank consistently
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
ProcessGroupXCCL.hpp | Added declaration for globalRank() method |
ProcessGroupXCCL.cpp | Implemented globalRank() method and updated references to use it |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
I've confirmed a hang still exists in the scale-out test, but it's due to a different issue, so I will address that in a different issue/PR. |
@frost-intel Could we merge this PR now? |
@zhangxiaoli73 Yes, it's ready for merge. |
@zhangxiaoli73 I don't have permission to merge. I assumed you did. @chuanqi129 @guangyey Can someone merge this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL uses globalRank
in few other places not handled in this PR: in ProcessGroupNCCL::HeartbeatMonitor
and ProcessGroupNCCL::Watchdog
:
- https://github.com/pytorch/pytorch/blob/60f0a356fd3249fba5ebe2ee525f9b154f1197a9/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1726
- https://github.com/pytorch/pytorch/blob/60f0a356fd3249fba5ebe2ee525f9b154f1197a9/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2170
I don't see watchdog in XCCL code, but we have HeartbeatMonitor
which still uses getRank()
instead of globalRank()
:
dumpPipe.emplace(pg_->getRank()); |
@frost-intel : should this also be changed?
@dvrogozh Good catch. Watchdog is currently WIP as a 2.10 feature, but I've fixed this here. |
Ok, we have "get runner" curse which blocks the ci. And we need it to let PR go... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frost-intel, do we expect any fixes in CI after this change? If not, is that possible to add CI test coverage?
Also, I believe these ci failures are unrelated, right?
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_op_xpu.TestQuantizedOpsXPU,test_add_scalar_relu_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_op_xpu.TestQuantizedOpsXPU,test_cat_nhwc_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_op_xpu.TestQuantizedOpsXPU,test_custom_module_multi_head_attention_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_tensor_xpu.TestQuantizedTensorXPU,test_repeat_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU,test_learnable_forward_per_channel_cuda_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU,test_learnable_backward_per_channel_cuda_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU,test_forward_per_channel_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU,test_forward_per_tensor_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU,test_learnable_forward_per_channel_cpu_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.test_foreach_xpu.TestForeachXPU,test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_xpu_complex128
@dvrogozh Those failures are unrelated, this change only impacts distributed workloads. This is a fix for a flaky distributed test which randomly hung. At one point before a series of merge commits to keep up with main, this PR did pass CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #1978
In ProcessGroupNCCL,
globalRank()
returns a static int globalRank, which is first initialized by the ProcessGroup setup, so the globalRank assigned to each thread matches the id of the CUDA device. However, we were not using this same pattern for XCCL. Instead, we were just using the assigned rank of the thread, which were not necessarily the same as the globalRank.The failing test
test_barrier
created two separate groups of 2 ranks each, and then 4 threads called barrier, but all on the same 2-thread group. Since initially the device id is not specified in this barrier call, the thread attempts to "guess" the device index. In the previous code, this guess would be 0 or 1, since the rank of each thread was not actually the globalRank. Inbarrier
, this guessed id was used to initialize XCCLComm objects, and then call allreduce on a single element tensor. However, this resulted in an allreduce call two times on each device, which could result in a hang based on the execution order of the 4 threads.With the update in this PR, PGXCCL now uses the static globalRank in the same places as ProcessGroupNCCL, so the initialized XCCLComm objects are for unique devices and allreduce doesn't call on the same device multiple times.