[BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend #127956

XilunWu · 2024-06-04T20:32:36Z

Stack from ghstack (oldest at bottom):

Summary
The use of TORCH_ERROR in TCPStore libuv backend code needs update.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Differential Revision: D58259589

[ghstack-poisoned]

pytorch-bot · 2024-06-04T20:32:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127956

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rebase your PRs: Unstable CUDA signal in CI caused by cudnn 9 update

❌ 1 New Failure, 11 Unrelated Failures

As of commit 59f8ba7 with merge base 597922b ():

NEW FAILURE - The following job has failed:

trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh)
'test/dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_nested_list_out_dynamic_shapes'

FLAKY - The following job failed but was likely due to flakiness present on trunk:

linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test (gh) (similar failure)
ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
FAILED
pull / linux-focal-py3.11-clang10 / test (dynamo, 1, 3, linux.2xlarge) (gh) (trunk failure)
FAILED
pull / linux-focal-py3.11-clang10 / test (dynamo, 2, 3, linux.2xlarge) (gh) (trunk failure)
FAILED
pull / linux-focal-py3.12-clang10 / test (dynamo, 1, 3, linux.2xlarge) (gh) (trunk failure)
FAILED
pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh) (trunk failure)
FAILED
pull / linux-focal-py3.8-clang10 / test (dynamo, 1, 3, linux.2xlarge) (gh) (trunk failure)
FAILED
pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh) (trunk failure)
FAILED

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (#127288)
ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test (gh) (#127289)
ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory
pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) ()
FAILED

This comment was automatically generated by Dr. CI and updates every 15 minutes.

**Summary** The use of TORCH_ERROR in TCPStore libuv backend code needs update. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

XilunWu · 2024-06-05T18:45:40Z

@pytorchbot merge

pytorchmergebot · 2024-06-05T18:47:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-05T20:43:25Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable)

Details for Dev Infra team

Raised by workflow job

XilunWu · 2024-06-06T21:33:44Z

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

XilunWu · 2024-06-06T23:27:34Z

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

XilunWu · 2024-06-07T06:02:45Z

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mrshenli

**Summary** This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability: <img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02"> We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one. **What it changes** This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs. One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server. **Fallback/Remain using the old TCPStore server** For users who want to stay with the old TCPStore backend, there're 3 ways: 1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`. 2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")` 3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching. These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv. **Operating Systems Compatibility** From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label. **Test Plan** `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> **TODO** 1. Update the doc at - https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store - https://pytorch.org/docs/stable/distributed.html#tcp-initialization 2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman 3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`. **Test Plan** `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591) Pull Request resolved: #127957 Approved by: https://github.com/kurman ghstack dependencies: #127956

…127956) **Summary** The use of TORCH_ERROR in TCPStore libuv backend code needs update. Differential Revision: [D58259589](https://our.internmc.facebook.com/intern/diff/D58259589) Pull Request resolved: pytorch#127956 Approved by: https://github.com/shuqiangzhang, https://github.com/cyyever

@mrshenli

…27957) **Summary** This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability: <img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02"> We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one. **What it changes** This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs. One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server. **Fallback/Remain using the old TCPStore server** For users who want to stay with the old TCPStore backend, there're 3 ways: 1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`. 2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")` 3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching. These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv. **Operating Systems Compatibility** From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label. **Test Plan** `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> **TODO** 1. Update the doc at - https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store - https://pytorch.org/docs/stable/distributed.html#tcp-initialization 2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman 3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`. **Test Plan** `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591) Pull Request resolved: pytorch#127957 Approved by: https://github.com/kurman ghstack dependencies: pytorch#127956

[BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend

857e96e

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 4, 2024

XilunWu mentioned this pull request Jun 4, 2024

[c10d][TCPStore] make TCPStore server use libuv by default #127957

Closed

XilunWu requested review from cyyever, wconstab and kwen2501 June 4, 2024 20:36

XilunWu added better-engineering Relatively self-contained tasks for better engineering contributors topic: not user facing topic category labels Jun 4, 2024

XilunWu requested a review from kurman June 4, 2024 20:38

shuqiangzhang approved these changes Jun 4, 2024

View reviewed changes

cyyever approved these changes Jun 5, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 5, 2024

pytorchmergebot added the merging label Jun 5, 2024

pytorchmergebot removed the merging label Jun 5, 2024

XilunWu mentioned this pull request Jun 6, 2024

[TorchElastic] make libuv TCPStore as default in torch elastic rendezvous #128168

Closed

XilunWu requested a review from wz337 June 7, 2024 06:06

cyyever approved these changes Jun 7, 2024

View reviewed changes

pytorchmergebot closed this in 6c824cd Jun 7, 2024

pytorchmergebot added the Merged label Jun 7, 2024

github-actions bot deleted the gh/XilunWu/83/head branch July 8, 2024 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend #127956

[BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend #127956

Uh oh!

XilunWu commented Jun 4, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 4, 2024 •

edited

Loading

Uh oh!

XilunWu commented Jun 5, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Uh oh!

XilunWu commented Jun 6, 2024

Uh oh!

XilunWu commented Jun 6, 2024

Uh oh!

XilunWu commented Jun 7, 2024

Uh oh!

Uh oh!

[BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend #127956

[BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend #127956

Uh oh!

Conversation

XilunWu commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127956

❗ 1 Active SEVs

❌ 1 New Failure, 11 Unrelated Failures

Uh oh!

XilunWu commented Jun 5, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 5, 2024

Merge failed

Uh oh!

XilunWu commented Jun 6, 2024

Uh oh!

XilunWu commented Jun 6, 2024

Uh oh!

XilunWu commented Jun 7, 2024

Uh oh!

Uh oh!

XilunWu commented Jun 4, 2024 •

edited

Loading

pytorch-bot bot commented Jun 4, 2024 •

edited

Loading