Skip to content

Conversation

@tstamler
Copy link
Contributor

What?

Update the remote storage example with pipelining code and more details in README.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi tstamler! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@tstamler tstamler force-pushed the remote_storage_pipeline branch from 1111e18 to c3f3cdc Compare October 14, 2025 15:13
@tstamler tstamler marked this pull request as ready for review October 17, 2025 14:20
Signed-off-by: Timothy Stamler <[email protected]>
@tstamler tstamler force-pushed the remote_storage_pipeline branch from b2c6c33 to 805df1e Compare October 17, 2025 14:20
@tstamler tstamler force-pushed the remote_storage_pipeline branch from cb625ac to 93ab2a6 Compare October 17, 2025 14:41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to have some background for this image or change the color of the black titles, as they become almost invisible in dark mode

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @roiedanino comment

Comment on lines +290 to +292
end_time = time.time()

elapsed = end_time - start_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end_time = time.time()
elapsed = end_time - start_time
elapsed = time.time() - start_time

Comment on lines +308 to +310
end_time = time.time()

elapsed = end_time - start_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end_time = time.time()
elapsed = end_time - start_time
elapsed = time.time() - start_time

Comment on lines +369 to +372
mem = "DRAM"

if args.role == "client":
mem = "VRAM"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have some constants for "DRAM", "VRAM", "GDS_MT" and so on?

nixl_mem_reg_list = []
nixl_file_reg_list = []

if mem == "VRAM":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to use constants for memory/storage types


elapsed = end_time - start_time

logger.info(f"Time for {iterations} READ iterations: {elapsed} seconds")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info(f"Time for {iterations} READ iterations: {elapsed} seconds")
logger.info("Time for %s READ iterations: %s seconds", iterations, elapsed)

We shouldn't use f-strings in loggers, it's not optimized, pylint would warn about it

s += 1
continue

if s == iterations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this flow happen with the s < iterations while condition?

Comment on lines +106 to +118
if s == 0:
futures.append(
executor.submit(
execute_transfer,
my_agent,
my_mem_descs,
my_file_descs,
my_agent.name,
"READ",
)
)
s += 1
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can move this line before the loop, initiating s to 1?

I think it would simplify the loop and help avoid a branch


end_time = time.time()

logger.info(f"Time for {iterations} iterations: {end_time - start_time} seconds")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string -> %s logger formatting

### Pipelining

To improve performance of the remote storage server, we can pipeline operations to network and storage. This pipelining allows multiple threads to handle each request. However, in order to maintain correctness, the order of network and storage must happen in order for each individual remote storage operation. To do this, we implemented a simple pipelining scheme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tstamler please can you add some more description on how this is implemented with NIXL in this example?


### Performance Tips

For high-speed storage and network hardware, you may need to tweak performance with a couple of environment variables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide suggested list of env - vars, (like providing example configuration with CX-7, SPX and providing what UCX tuning/GDS tuning that is seen to be beneficial?


For high-speed storage and network hardware, you may need to tweak performance with a couple of environment variables.

First, for optimal GDS performance, ensure you are using the GDS_MT backend with default concurrency. Additionally, you can use the cufile options described in the [GDS README](https://github.com/ai-dynamo/nixl/blob/main/src/plugins/cuda_gds/README.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also highlight the difference between the GDS backend and that GDS plugins can work in compat mode by default, and what is required for true GDS support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @roiedanino comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants