-
Notifications
You must be signed in to change notification settings - Fork 175
Add UCCL backend integration for NIXL #895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
|
👋 Hi praveingk! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
src/plugins/uccl/README.md
Outdated
| ## Usage Guide | ||
|
|
||
| ### Additional Parameters | ||
| 1. `device_idx` : Specifies which GPU the UCCL engine will be affined to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that UCCL does not support multiple GPUs within a single engine or process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the uccl backend does not support multiple GPUs within a single engine. However, based on the use-case, it can be enabled. Based on the current vLLM's NIXL connector, each worker has its own nixl connector. Hence, this was not enabled.
In this config, UCCL uses the device_idx (GPU ID) to determine the right NIC that UCCL engine will be launched at, based on the PCIe topology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can device id be found implicitly without introducing new parameter? (e.g. by querying cuda, nvml, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure 👍🏽 I have just raised an issue in UCCL uccl-project/uccl#487 and will start working on that. I can remove the device_idx requirement for now.
| @@ -0,0 +1,676 @@ | |||
| /* | |||
| * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to mention your copyright? (also relevant for other files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the copyright based on other plugins. I can change it accordingly if that is ok.
| * limitations under the License. | ||
| */ | ||
|
|
||
| #include <algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general tests in test/nixl/test_plugin.cpp are also worth enabling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure 👍🏽 I will look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added UCCL to test_plugin.ccp
|
General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU). |
@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes. Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P. I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR. |
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Collectives are not utilized by NIXL. Could you please elaborate a bit more on the aspects of extensibility and congestion control? How would these be leveraged in NIXL use cases, particularly in terms of performance? |
UCCL currently provides various congestion control algorithms - sender driven (TIMELY, SWIFT) and receiver driven (EQDS) which can be leveraged based on the workload characteristics. In the context of a usecase specific to NIXL, PD disaggregation. Receiver driven congestion control could be helpful in scenarios where specific decode pods are seeing bursty incast traffic from different prefill pods. In addition, UCCL emulates packet spraying to leverage the available network paths to avoid "single-path-of-congestion", while implementing congestion-control on each path separately. At scale, this could evenly spread traffic at the core of the network alleviating congestion, packet loss and tail latency which directly contribute to TTFT in PD disaggregation. In my preliminary experiments, I observed UCCL backend achieved 10% lesser TTFT than UCX when prefill/decode pods are allocated cross-rack, since its performing flow-splitting of a single KVCache message of 3GB into smaller chunks that are split across the available paths. Finally, UCCL separates the optimization logic and heterogenous GPU/NIC/transport hardware logic. Hence, the same optimizations can be applied to different NICs (currently tested with Nvidia and Broadcom NICs), different transport types(AF_XDP-based user space TCP, RDMA, EFA, GPU Direct TCP-X, etc). In the first version, UCCL backend supports RDMA, and adding more transport types are on UCCL P2P agenda (currently AF_XDP-based user space TCP, amazon's EFA are available in UCCL Collectives). Since, the optimizations are run on the CPU, they can be evolved as the workloads/transport evolve. |
Signed-off-by: Pravein Govindan Kannan <[email protected]>
What?
Added UCCL backend support to NIXL.
Why?
UCCL is an efficient communication library for GPUs, supporting P2P transfers, collectives and EP. UCCL focusses on flexibility for fast-evolving ML workloads, and portability for connecting heterogeneous GPU/NIC vendors. It provides a software transport stack which runs on the CPUs and are easily extensible to support different communication optimization techniques like congestion control, multipathing, efficient loss recovery, etc.
UCCL supports collectives, p2p communication and gpu-driven communication for expert parallelism.
How?
gtest/pluginsto test basic xfer.