Add UCCL backend integration for NIXL #895

praveingk · 2025-10-13T08:47:19Z

What?

Added UCCL backend support to NIXL.

Why?

UCCL is an efficient communication library for GPUs, supporting P2P transfers, collectives and EP. UCCL focusses on flexibility for fast-evolving ML workloads, and portability for connecting heterogeneous GPU/NIC vendors. It provides a software transport stack which runs on the CPUs and are easily extensible to support different communication optimization techniques like congestion control, multipathing, efficient loss recovery, etc.
UCCL supports collectives, p2p communication and gpu-driven communication for expert parallelism.

How?

Added the basic plugin for UCCL for inter-node transfers with RDMA with further enhancements in the road map.
Added a test in gtest/plugins to test basic xfer.
Provided references to use the UCCL backend.

Signed-off-by: Pravein Govindan Kannan <[email protected]>

copy-pr-bot · 2025-10-13T08:47:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-10-13T08:47:28Z

👋 Hi praveingk! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

Signed-off-by: Pravein Govindan Kannan <[email protected]>

src/plugins/uccl/README.md

brminich · 2025-10-24T14:22:44Z

src/plugins/uccl/README.md

+## Usage Guide
+
+### Additional Parameters
+1. `device_idx` : Specifies which GPU the UCCL engine will be affined to.


Does this mean that UCCL does not support multiple GPUs within a single engine or process?

Currently, the uccl backend does not support multiple GPUs within a single engine. However, based on the use-case, it can be enabled. Based on the current vLLM's NIXL connector, each worker has its own nixl connector. Hence, this was not enabled.
In this config, UCCL uses the device_idx (GPU ID) to determine the right NIC that UCCL engine will be launched at, based on the PCIe topology.

can device id be found implicitly without introducing new parameter? (e.g. by querying cuda, nvml, etc)

Sure 👍🏽 I have just raised an issue in UCCL uccl-project/uccl#487 and will start working on that. I can remove the device_idx requirement for now.

brminich · 2025-10-24T14:54:02Z

src/plugins/uccl/uccl_backend.cpp

@@ -0,0 +1,676 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


do you need to mention your copyright? (also relevant for other files)

I added the copyright based on other plugins. I can change it accordingly if that is ok.

brminich · 2025-10-24T14:57:28Z

test/gtest/plugins/uccl/uccl_test.cpp

+ * limitations under the License.
+ */
+
+#include <algorithm>


general tests in test/nixl/test_plugin.cpp are also worth enabling

Sure 👍🏽 I will look into that.

I have added UCCL to test_plugin.ccp

brminich · 2025-10-24T15:07:20Z

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

praveingk · 2025-10-25T04:20:26Z

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.

Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.

I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Signed-off-by: Pravein Govindan Kannan <[email protected]>

brminich · 2025-10-27T14:02:10Z

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.

Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.

I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Collectives are not utilized by NIXL. Could you please elaborate a bit more on the aspects of extensibility and congestion control? How would these be leveraged in NIXL use cases, particularly in terms of performance?

praveingk · 2025-10-28T05:13:26Z

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.
Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.
I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Collectives are not utilized by NIXL. Could you please elaborate a bit more on the aspects of extensibility and congestion control? How would these be leveraged in NIXL use cases, particularly in terms of performance?

UCCL currently provides various congestion control algorithms - sender driven (TIMELY, SWIFT) and receiver driven (EQDS) which can be leveraged based on the workload characteristics. In the context of a usecase specific to NIXL, PD disaggregation. Receiver driven congestion control could be helpful in scenarios where specific decode pods are seeing bursty incast traffic from different prefill pods. In addition, UCCL emulates packet spraying to leverage the available network paths to avoid "single-path-of-congestion", while implementing congestion-control on each path separately. At scale, this could evenly spread traffic at the core of the network alleviating congestion, packet loss and tail latency which directly contribute to TTFT in PD disaggregation. In my preliminary experiments, I observed UCCL backend achieved 10% lesser TTFT than UCX when prefill/decode pods are allocated cross-rack, since its performing flow-splitting of a single KVCache message of 3GB into smaller chunks that are split across the available paths.

Finally, UCCL separates the optimization logic and heterogenous GPU/NIC/transport hardware logic. Hence, the same optimizations can be applied to different NICs (currently tested with Nvidia and Broadcom NICs), different transport types(AF_XDP-based user space TCP, RDMA, EFA, GPU Direct TCP-X, etc). In the first version, UCCL backend supports RDMA, and adding more transport types are on UCCL P2P agenda (currently AF_XDP-based user space TCP, amazon's EFA are available in UCCL Collectives). Since, the optimizations are run on the CPU, they can be evolved as the workloads/transport evolve.

Signed-off-by: Pravein Govindan Kannan <[email protected]>

… uccl-backend-pr

praveingk added 7 commits October 13, 2025 12:03

Add support uccl-backend for NIXL

86f6d6b

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Cleanup and minor changes to readme

cd2bbea

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Further cleanup including moving startListener to private

7e211cb

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Revert changes to plugin manager test

b5e3c5b

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Fix formatting

9648e6e

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Add more details to readme

489ecd0

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Remove trailing whitespaces in readme

25da211

Signed-off-by: Pravein Govindan Kannan <[email protected]>

praveingk requested a review from a team as a code owner October 13, 2025 08:47

pull-request-size bot added the size/XXL label Oct 13, 2025

github-actions bot added the external-contribution label Oct 13, 2025

praveingk added 4 commits October 13, 2025 14:20

Remove trailing whitespaces in test

baa7c85

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Remove num_cpus from agent config

db047c0

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Fix typo in uccl backend

524e078

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Merge branch 'main' into uccl-backend-pr

be90f7f

This was referenced Oct 15, 2025

Develop an UCCL plugin for NIXL uccl-project/uccl#38

Open

[Feature]: Add uccl as kvconnect provide vllm-project/vllm#24079

Open

brminich reviewed Oct 24, 2025

View reviewed changes

praveingk added 3 commits October 27, 2025 11:54

Add UCCL to test_plugin

dde50c5

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Modify env vars descriptions and road map to README

7eb674c

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Merge branch 'main' into uccl-backend-pr

a972a7c

Signed-off-by: Pravein Govindan Kannan <[email protected]>

praveingk added 2 commits October 30, 2025 10:41

Remove custom parameter device_idx

96866cd

Signed-off-by: Pravein Govindan Kannan <[email protected]>

Merge branch 'uccl-backend-pr' of github.com:praveingk/nixl-uccl into…

06c74a1

… uccl-backend-pr

		@@ -0,0 +1,676 @@
		/*
		* SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Add UCCL backend integration for NIXL #895

Are you sure you want to change the base?

Add UCCL backend integration for NIXL #895

Uh oh!

Conversation

praveingk commented Oct 13, 2025

What?

Why?

How?

Uh oh!

copy-pr-bot bot commented Oct 13, 2025

Uh oh!

github-actions bot commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brminich commented Oct 24, 2025

Uh oh!

praveingk commented Oct 25, 2025

Uh oh!

brminich commented Oct 27, 2025

Uh oh!

praveingk commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants