Skip to content

Conversation

@praveingk
Copy link

What?

Added UCCL backend support to NIXL.

Why?

UCCL is an efficient communication library for GPUs, supporting P2P transfers, collectives and EP. UCCL focusses on flexibility for fast-evolving ML workloads, and portability for connecting heterogeneous GPU/NIC vendors. It provides a software transport stack which runs on the CPUs and are easily extensible to support different communication optimization techniques like congestion control, multipathing, efficient loss recovery, etc.
UCCL supports collectives, p2p communication and gpu-driven communication for expert parallelism.

How?

  1. Added the basic plugin for UCCL for inter-node transfers with RDMA with further enhancements in the road map.
  2. Added a test in gtest/plugins to test basic xfer.
  3. Provided references to use the UCCL backend.

Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
@praveingk praveingk requested a review from a team as a code owner October 13, 2025 08:47
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi praveingk! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

## Usage Guide

### Additional Parameters
1. `device_idx` : Specifies which GPU the UCCL engine will be affined to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that UCCL does not support multiple GPUs within a single engine or process?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the uccl backend does not support multiple GPUs within a single engine. However, based on the use-case, it can be enabled. Based on the current vLLM's NIXL connector, each worker has its own nixl connector. Hence, this was not enabled.
In this config, UCCL uses the device_idx (GPU ID) to determine the right NIC that UCCL engine will be launched at, based on the PCIe topology.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can device id be found implicitly without introducing new parameter? (e.g. by querying cuda, nvml, etc)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure 👍🏽 I have just raised an issue in UCCL uccl-project/uccl#487 and will start working on that. I can remove the device_idx requirement for now.

@@ -0,0 +1,676 @@
/*
* SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to mention your copyright? (also relevant for other files)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the copyright based on other plugins. I can change it accordingly if that is ok.

* limitations under the License.
*/

#include <algorithm>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general tests in test/nixl/test_plugin.cpp are also worth enabling

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure 👍🏽 I will look into that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added UCCL to test_plugin.ccp

@brminich
Copy link
Contributor

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@praveingk
Copy link
Author

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.

Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.

I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Signed-off-by: Pravein Govindan Kannan <[email protected]>
Signed-off-by: Pravein Govindan Kannan <[email protected]>
@brminich
Copy link
Contributor

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.

Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.

I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Collectives are not utilized by NIXL. Could you please elaborate a bit more on the aspects of extensibility and congestion control? How would these be leveraged in NIXL use cases, particularly in terms of performance?

@praveingk
Copy link
Author

General question: What is the added value of the UCCL plugin? I notice quite a few missing features (intra-node support, progress thread, multi-GPU).

@brminich Thanks a lot for your review. UCCL's added value is its extensibility. It provides a way to program the control logic of GPU networking stack to enable newer congestion control protocols, loss-recovery features and multi-path transport, which are not usually easily programmable since they require NIC hardware/config changes. It's possible because it uses the CPU to run the control-logic. We have seen tremendous performance improvement when running UCCL compared to NCCL, and currently I am seeing 10% TTFT improvement with UCCL backend in my preliminary evaluation between two cross-rack nodes.
Additionally, UCCL provides a unified stack for collectives, P2P and EP which provides a way to customize the control logic based on the communication type. UCCL also brings in heterogenous GPU and NIC vendor support. Additional transport types like TCP, TCP-X are being added to UCCL P2P.
I will be adding the intra-node (already supported by uccl) and progress thread support in UCCL backend, in the coming days in a separate PR.

Collectives are not utilized by NIXL. Could you please elaborate a bit more on the aspects of extensibility and congestion control? How would these be leveraged in NIXL use cases, particularly in terms of performance?

UCCL currently provides various congestion control algorithms - sender driven (TIMELY, SWIFT) and receiver driven (EQDS) which can be leveraged based on the workload characteristics. In the context of a usecase specific to NIXL, PD disaggregation. Receiver driven congestion control could be helpful in scenarios where specific decode pods are seeing bursty incast traffic from different prefill pods. In addition, UCCL emulates packet spraying to leverage the available network paths to avoid "single-path-of-congestion", while implementing congestion-control on each path separately. At scale, this could evenly spread traffic at the core of the network alleviating congestion, packet loss and tail latency which directly contribute to TTFT in PD disaggregation. In my preliminary experiments, I observed UCCL backend achieved 10% lesser TTFT than UCX when prefill/decode pods are allocated cross-rack, since its performing flow-splitting of a single KVCache message of 3GB into smaller chunks that are split across the available paths.

Finally, UCCL separates the optimization logic and heterogenous GPU/NIC/transport hardware logic. Hence, the same optimizations can be applied to different NICs (currently tested with Nvidia and Broadcom NICs), different transport types(AF_XDP-based user space TCP, RDMA, EFA, GPU Direct TCP-X, etc). In the first version, UCCL backend supports RDMA, and adding more transport types are on UCCL P2P agenda (currently AF_XDP-based user space TCP, amazon's EFA are available in UCCL Collectives). Since, the optimizations are run on the CPU, they can be evolved as the workloads/transport evolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants