Skip to content

Conversation

@d4l3k
Copy link
Member

@d4l3k d4l3k commented Apr 7, 2025

When multiple operations are running we may end up calling getPair from multiple threads simultaneously. This causes a crash due to duplicate connect calls and state modifications.

This does add an additional synchronization point but under most usage this should be very cheap as after the first connection we only check a single variable and return.

Test plan:

pytorch/pytorch#150801

@facebook-github-bot
Copy link

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot facebook-github-bot merged commit c610704 into main Apr 8, 2025
8 of 9 checks passed
@d4l3k d4l3k deleted the d4l3k/gloo_async_lock branch April 8, 2025 18:43
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Apr 9, 2025
This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)`

This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on pytorch/gloo#427 landing first

This also updates the gloo submodule to include the required changes.

Test plan:

added lazy init test variants

```
pytest -v test/distributed/test_c10d_gloo.py -k Lazy
```

Pull Request resolved: #150801
Approved by: https://github.com/fduwjj
timocafe pushed a commit to timocafe/pytorch that referenced this pull request Apr 16, 2025
This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)`

This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on pytorch/gloo#427 landing first

This also updates the gloo submodule to include the required changes.

Test plan:

added lazy init test variants

```
pytest -v test/distributed/test_c10d_gloo.py -k Lazy
```

Pull Request resolved: pytorch#150801
Approved by: https://github.com/fduwjj
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)`

This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on pytorch/gloo#427 landing first

This also updates the gloo submodule to include the required changes.

Test plan:

added lazy init test variants

```
pytest -v test/distributed/test_c10d_gloo.py -k Lazy
```

Pull Request resolved: pytorch#150801
Approved by: https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants