Fix test_barrier hang by using static global rank in ProcessGroupXCCL #2036

frost-intel · 2025-09-12T19:07:47Z

Fixes #1978

In ProcessGroupNCCL, globalRank() returns a static int globalRank, which is first initialized by the ProcessGroup setup, so the globalRank assigned to each thread matches the id of the CUDA device. However, we were not using this same pattern for XCCL. Instead, we were just using the assigned rank of the thread, which were not necessarily the same as the globalRank.

The failing test test_barrier created two separate groups of 2 ranks each, and then 4 threads called barrier, but all on the same 2-thread group. Since initially the device id is not specified in this barrier call, the thread attempts to "guess" the device index. In the previous code, this guess would be 0 or 1, since the rank of each thread was not actually the globalRank. In barrier, this guessed id was used to initialize XCCLComm objects, and then call allreduce on a single element tensor. However, this resulted in an allreduce call two times on each device, which could result in a hang based on the execution order of the 4 threads.

With the update in this PR, PGXCCL now uses the static globalRank in the same places as ProcessGroupNCCL, so the initialized XCCLComm objects are for unique devices and allreduce doesn't call on the same device multiple times.

frost-intel · 2025-09-15T21:09:05Z

Please wait on merging, I want to confirm behavior in scale-out test.

Copilot

Pull Request Overview

This PR fixes a hang in the test_barrier test by making ProcessGroupXCCL use static global ranks consistently, matching the pattern used in ProcessGroupNCCL. The issue occurred when multiple threads used the same device ID for barrier operations, causing duplicate allreduce calls that could hang based on execution order.

Key changes:

Added globalRank() method that returns a static global rank value
Updated device ID guessing logic to use global rank instead of thread rank
Updated logging and debugging to reference global rank consistently

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
ProcessGroupXCCL.hpp	Added declaration for globalRank() method
ProcessGroupXCCL.cpp	Implemented globalRank() method and updated references to use it

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/xccl/ProcessGroupXCCL.cpp

frost-intel · 2025-09-22T20:31:34Z

Please wait on merging, I want to confirm behavior in scale-out test.

I've confirmed a hang still exists in the scale-out test, but it's due to a different issue, so I will address that in a different issue/PR.

zhangxiaoli73 · 2025-09-26T06:15:38Z

@frost-intel Could we merge this PR now?

frost-intel · 2025-09-26T12:11:27Z

@zhangxiaoli73 Yes, it's ready for merge.

frost-intel · 2025-09-29T12:35:42Z

@zhangxiaoli73 I don't have permission to merge. I assumed you did.

@chuanqi129 @guangyey Can someone merge this?

dvrogozh

NCCL uses globalRank in few other places not handled in this PR: in ProcessGroupNCCL::HeartbeatMonitor and ProcessGroupNCCL::Watchdog:

I don't see watchdog in XCCL code, but we have HeartbeatMonitor which still uses getRank() instead of globalRank():

torch-xpu-ops/src/xccl/ProcessGroupXCCLMonitor.cpp

Line 42 in f301733

dumpPipe.emplace(pg_->getRank());

@frost-intel : should this also be changed?

frost-intel · 2025-09-30T20:16:39Z

@dvrogozh Good catch. Watchdog is currently WIP as a 2.10 feature, but I've fixed this here.

dvrogozh · 2025-10-01T16:51:52Z

Ok, we have "get runner" curse which blocks the ci. And we need it to let PR go...

dvrogozh

@frost-intel, do we expect any fixes in CI after this change? If not, is that possible to add CI test coverage?

Also, I believe these ci failures are unrelated, right?

op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_op_xpu.TestQuantizedOpsXPU,test_add_scalar_relu_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_op_xpu.TestQuantizedOpsXPU,test_cat_nhwc_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_op_xpu.TestQuantizedOpsXPU,test_custom_module_multi_head_attention_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_quantized_tensor_xpu.TestQuantizedTensorXPU,test_repeat_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU,test_learnable_forward_per_channel_cuda_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU,test_learnable_backward_per_channel_cuda_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU,test_forward_per_channel_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU,test_forward_per_tensor_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU,test_learnable_forward_per_channel_cpu_xpu
op_ut,third_party.torch-xpu-ops.test.xpu.test_foreach_xpu.TestForeachXPU,test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_xpu_complex128

frost-intel · 2025-10-02T23:42:03Z

@dvrogozh Those failures are unrelated, this change only impacts distributed workloads. This is a fix for a flaky distributed test which randomly hung. At one point before a series of merge commits to keep up with main, this PR did pass CI.

dvrogozh

LGTM

Fix test_barrier hang using static global rank

60e57f8

frost-intel requested review from Chao1Han, Copilot and daisyden and removed request for Copilot September 12, 2025 19:07

Chao1Han approved these changes Sep 15, 2025

View reviewed changes

frost-intel changed the title ~~Fix test_barrier hang by using static global rank in ProcessGroupXCCL~~ [Do not merge] Fix test_barrier hang by using static global rank in ProcessGroupXCCL Sep 15, 2025

Merge branch 'main' into frost/barrier_hang_fix

ea1953c

Copilot AI review requested due to automatic review settings September 19, 2025 15:12

Copilot AI reviewed Sep 19, 2025

View reviewed changes

src/xccl/ProcessGroupXCCL.cpp Show resolved Hide resolved

frost-intel changed the title ~~[Do not merge] Fix test_barrier hang by using static global rank in ProcessGroupXCCL~~ Fix test_barrier hang by using static global rank in ProcessGroupXCCL Sep 22, 2025

frost-intel requested a review from zhangxiaoli73 September 22, 2025 20:31

zhangxiaoli73 approved these changes Sep 23, 2025

View reviewed changes

Merge branch 'main' into frost/barrier_hang_fix

e181005

Merge branch 'main' into frost/barrier_hang_fix

785018c

dvrogozh reviewed Sep 30, 2025

View reviewed changes

Fix Global rank in heartbeat.

b078f54

Merge branch 'main' into frost/barrier_hang_fix

45d6e14

dvrogozh reviewed Oct 2, 2025

View reviewed changes

dvrogozh approved these changes Oct 3, 2025

View reviewed changes

dvrogozh added this pull request to the merge queue Oct 3, 2025

Merged via the queue into main with commit 086f20a Oct 3, 2025
66 of 75 checks passed

dvrogozh deleted the frost/barrier_hang_fix branch October 3, 2025 00:08

Fix test_barrier hang by using static global rank in ProcessGroupXCCL #2036

Fix test_barrier hang by using static global rank in ProcessGroupXCCL #2036

Uh oh!

Conversation

frost-intel commented Sep 12, 2025

Uh oh!

frost-intel commented Sep 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

frost-intel commented Sep 22, 2025

Uh oh!

zhangxiaoli73 commented Sep 26, 2025

Uh oh!

frost-intel commented Sep 26, 2025

Uh oh!

frost-intel commented Sep 29, 2025

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

frost-intel commented Sep 30, 2025

Uh oh!

dvrogozh commented Oct 1, 2025

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

frost-intel commented Oct 2, 2025

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants