Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 60 additions & 52 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ pip install nixl
## Prerequisites for source build
### Ubuntu:

`$ sudo apt install build-essential cmake pkg-config`
`sudo apt install build-essential cmake pkg-config`

### Fedora:

`$ sudo dnf install gcc-c++ cmake pkg-config`
`sudo dnf install gcc-c++ cmake pkg-config`

### Python

`$ pip3 install meson ninja pybind11`
`pip3 install meson ninja pybind11`

### UCX

Expand All @@ -39,47 +39,51 @@ NIXL was tested with UCX version 1.19.0.
[GDRCopy](https://github.com/NVIDIA/gdrcopy) is available on Github and is necessary for maximum performance, but UCX and NIXL will work without it.

```
$ wget https://github.com/openucx/ucx/releases/download/v1.19.0/ucx-1.19.0.tar.gz
$ tar xzf ucx-1.19.0.tar.gz
$ cd ucx-1.19.0
$ ./configure \
--enable-shared \
--disable-static \
--disable-doxygen-doc \
--enable-optimizations \
--enable-cma \
--enable-devel-headers \
--with-cuda=<cuda install> \
--with-verbs \
--with-dm \
--with-gdrcopy=<gdrcopy install> \
--enable-mt
$ make -j
$ make -j install-strip
$ ldconfig
wget https://github.com/openucx/ucx/releases/download/v1.19.0/ucx-1.19.0.tar.gz
tar xzf ucx-1.19.0.tar.gz
cd ucx-1.19.0
./configure \
--enable-shared \
--disable-static \
--disable-doxygen-doc \
--enable-optimizations \
--enable-cma \
--enable-devel-headers \
--with-cuda=<cuda install> \
--with-verbs \
--with-dm \
--with-gdrcopy=<gdrcopy install> \
--enable-mt
make -j
make -j install-strip
ldconfig
```

### ETCD (Optional)
NIXL can use ETCD for metadata distribution and coordination between nodes in distributed environments. To use ETCD with NIXL:
#### ETCD Server and Client
```
$ sudo apt install etcd etcd-server etcd-client
sudo apt install etcd etcd-server etcd-client

# Or use Docker
$ docker run -d -p 2379:2379 quay.io/coreos/etcd:v3.5.1
docker run -d -p 2379:2379 -p 2380:2380 \
--name etcd-server \
-e ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379 \
-e ETCD_ADVERTISE_CLIENT_URLS=http://0.0.0.0:2379 \
quay.io/coreos/etcd:v3.5.1
```

#### ETCD CPP API
Installed from https://github.com/etcd-cpp-apiv3/etcd-cpp-apiv3

```
$ sudo apt install libgrpc-dev libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc
$ sudo apt install libcpprest-dev
$ git clone https://github.com/etcd-cpp-apiv3/etcd-cpp-apiv3.git
$ cd etcd-cpp-apiv3
$ mkdir build && cd build
$ cmake ..
$ make -j$(nproc) && make install
sudo apt install libgrpc-dev libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc
sudo apt install libcpprest-dev
git clone https://github.com/etcd-cpp-apiv3/etcd-cpp-apiv3.git
cd etcd-cpp-apiv3
mkdir build && cd build
cmake ..
make -j$(nproc) && make install
```

### Additional plugins
Expand All @@ -94,35 +98,35 @@ Some plugins may have additional build requirements, see them here:
### Build & install

```
$ meson setup <name_of_build_dir>
$ cd <name_of_build_dir>
$ ninja
$ ninja install
meson setup <name_of_build_dir>
cd <name_of_build_dir>
ninja
ninja install
```

### Build Options

#### Release build

```bash
$ meson setup <name_of_build_dir> --buildtype=release
meson setup <name_of_build_dir> --buildtype=release
```

#### Debug build (default)

```bash
$ meson setup <name_of_build_dir>
meson setup <name_of_build_dir>
```

#### NIXL-specific build options

```bash
# Example with custom options
$ meson setup <name_of_build_dir> \
-Dbuild_docs=true \ # Build Doxygen documentation
-Ducx_path=/path/to/ucx \ # Custom UCX installation path
-Dinstall_headers=true \ # Install development headers
-Ddisable_gds_backend=false # Enable GDS backend
meson setup <name_of_build_dir> \
-Dbuild_docs=true \ # Build Doxygen documentation
-Ducx_path=/path/to/ucx \ # Custom UCX installation path
-Dinstall_headers=true \ # Install development headers
-Ddisable_gds_backend=false # Enable GDS backend
```

Common build options:
Expand All @@ -139,9 +143,9 @@ If you have Doxygen installed, you can build the documentation:

```bash
# Configure with documentation enabled
$ meson setup <name_of_build_dir> -Dbuild_docs=true
$ cd <name_of_build_dir>
$ ninja
meson setup <name_of_build_dir> -Dbuild_docs=true
cd <name_of_build_dir>
ninja

# Documentation will be generated in <name_of_build_dir>/html
# After installation (ninja install), documentation will be available in <prefix>/share/doc/nixl/
Expand Down Expand Up @@ -172,19 +176,19 @@ For Python examples, see [examples/python/](examples/python/).
- Use `-Ddebug=false` for a release build.
- Or build manually:
```bash
$ cargo build --release
cargo build --release
```
#### Install
The bindings will be installed under `nixl-sys` in the configured installation prefix.
Can be done using ninja, from project build directory:
```bash
$ ninja install
ninja install
```

#### Test
```
# Rust bindings tests
$ cargo test
cargo test
```

Use in your project by adding to `Cargo.toml`:
Expand All @@ -201,25 +205,25 @@ To build the docker container, first clone the current repository. Also make sur

Run the following from the root folder of the cloned NIXL repository:
```
$ ./contrib/build-container.sh
./contrib/build-container.sh
```

By default, the container is built with Ubuntu 24.04. To build a container for Ubuntu 22.04 use the --os option as follows:
```
$ ./contrib/build-container.sh --os ubuntu22
./contrib/build-container.sh --os ubuntu22
```

To see all the options supported by the container use:
```
$ ./contrib/build-container.sh -h
./contrib/build-container.sh -h
```

The container also includes a prebuilt python wheel in /workspace/dist if required for installing/distributing. Also, the wheel can be built with a separate script (see below).

### Building the python wheel
The contrib folder also includes a script to build the python wheel with the UCX dependencies. Note, that UCX and other NIXL dependencies are required to be installed.
```
$ ./contrib/build-wheel.sh
./contrib/build-wheel.sh
```

## Running with ETCD
Expand All @@ -242,7 +246,11 @@ NIXL includes an example demonstrating metadata exchange and data transfer using
```bash
# Start an ETCD server if not already running
# For example:
# docker run -d -p 2379:2379 quay.io/coreos/etcd:v3.5.1
# docker run -d -p 2379:2379 -p 2380:2380 \
# --name etcd-server \
# -e ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379 \
# -e ETCD_ADVERTISE_CLIENT_URLS=http://0.0.0.0:2379 \
# quay.io/coreos/etcd:v3.5.1

# Set the ETCD env variables as above

Expand Down
158 changes: 158 additions & 0 deletions examples/python/nixl_etcd_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
#!/usr/bin/env python3

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import time

import nixl._utils as nixl_utils
from nixl._api import nixl_agent, nixl_agent_config
from nixl.logging import get_logger

logger = get_logger(__name__)

# Configuration - Change these values to match your etcd setup
ETCD_ENDPOINT = "http://127.0.0.1:2379"
AGENT1_NAME = "EtcdAgent1"
AGENT2_NAME = "EtcdAgent2"


def register_memory(agent, backend_name, pattern):
buffer_size = 1024
addr = nixl_utils.malloc_passthru(buffer_size)

# Fill buffer with pattern
data = pattern * buffer_size
reg_descs = agent.get_reg_descs([(addr, buffer_size, 0, data)], "DRAM")

# Register memory
agent.register_memory(reg_descs, backends=[backend_name])

logger.info(f"Registered memory {hex(addr)} with agent {agent.name}")
return addr, reg_descs


def main():
# Set etcd endpoint if not already set
if os.getenv("NIXL_ETCD_ENDPOINTS"):
logger.info("NIXL_ETCD_ENDPOINTS is set")
else:
logger.info(f"NIXL_ETCD_ENDPOINTS is not set, setting to {ETCD_ENDPOINT}")
os.environ["NIXL_ETCD_ENDPOINTS"] = ETCD_ENDPOINT

logger.info("NIXL Etcd Metadata Example")

# ===== 1. Create two agents (normally these would be in separate processes or machines) =====
agent1_config = nixl_agent_config(backends=["UCX"])
agent1 = nixl_agent(AGENT1_NAME, agent1_config)

agent2_config = nixl_agent_config(backends=["UCX"])
agent2 = nixl_agent(AGENT2_NAME, agent2_config)

logger.info(f"Available plugins: {agent1.plugin_list}")

# Get plugin parameters
logger.info(
"Plugin parameters:\n%s\n%s",
agent1.get_plugin_mem_types("UCX"),
agent1.get_plugin_params("UCX"),
)

logger.info(
"Backend parameters:\n%s\n%s",
agent1.get_backend_mem_types("UCX"),
agent1.get_backend_params("UCX"),
)

# ===== 2. Register memory with both agents =====
addr1, reg_descs1 = register_memory(agent1, "UCX", "a")
addr2, reg_descs2 = register_memory(agent2, "UCX", "b")

# ===== 3. Send Local Metadata to etcd =====
logger.info("Sending local metadata to etcd...")

# Both agents send their metadata to etcd
agent1.send_local_metadata()
agent2.send_local_metadata()

# Give etcd time to process
time.sleep(1)

# ===== 4. Fetch Remote Metadata from etcd =====
logger.info("Fetching remote metadata from etcd...")

# Agent1 fetches metadata for Agent2
agent1.fetch_remote_metadata(AGENT2_NAME)

# Agent2 fetches metadata for Agent1
agent2.fetch_remote_metadata(AGENT1_NAME)

# Wait for metadata to be available (fetch_remote_metadata is asynchronous)
while not (
agent1.check_remote_metadata(AGENT2_NAME)
and agent2.check_remote_metadata(AGENT1_NAME)
):
time.sleep(0.5)

logger.info("Metadata exchange successful!")

# ===== 5. Do transfer from Agent 1 to Agent 2 =====
req_size = 8
dst_offset = 8

logger.info(f"Agent1's address: {hex(addr1)}")
logger.info(f"Agent2's address: {hex(addr2)}")

# Create transfer descriptors
req_src_descs = agent1.get_xfer_descs(
[(addr1 + 16, req_size, 0)], "DRAM"
) # random offset
req_dst_descs = agent2.get_xfer_descs(
[(addr2 + dst_offset, req_size, 0)], "DRAM"
) # random offset

logger.info(f"Transfer request from {hex(addr1 + 16)} to {hex(addr2 + dst_offset)}")

# Create and post transfer request with notification
xfer_handle = agent1.initialize_xfer(
"WRITE", req_src_descs, req_dst_descs, AGENT2_NAME, b"notification"
)
logger.info("Transfer request created")
state = agent1.transfer(xfer_handle)
logger.info(f"Transfer was posted, initial state: {state}")

# Wait for transfer completion and notification
notifs = {}
while state != "DONE" or len(notifs) == 0:
if state != "DONE":
state = agent1.check_xfer_state(xfer_handle)
if len(notifs) == 0:
notifs = agent2.get_new_notifs()
time.sleep(0.5)

logger.info(f"Received notifications: {notifs}")
logger.info("Transfer verified")

# Release transfer handle
agent1.release_xfer_handle(xfer_handle)

# Deregister memory
agent1.deregister_memory(reg_descs1, backends=["UCX"])
agent2.deregister_memory(reg_descs2, backends=["UCX"])


if __name__ == "__main__":
main()