- 
                Notifications
    You must be signed in to change notification settings 
- Fork 174
Remote storage pipeline #899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 👋 Hi tstamler! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 | 
1111e18    to
    c3f3cdc      
    Compare
  
    Signed-off-by: Timothy Stamler <[email protected]>
b2c6c33    to
    805df1e      
    Compare
  
    Signed-off-by: Timothy Stamler <[email protected]>
cb625ac    to
    93ab2a6      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @roiedanino comment
| end_time = time.time() | ||
|  | ||
| elapsed = end_time - start_time | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| end_time = time.time() | |
| elapsed = end_time - start_time | |
| elapsed = time.time() - start_time | 
| end_time = time.time() | ||
|  | ||
| elapsed = end_time - start_time | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| end_time = time.time() | |
| elapsed = end_time - start_time | |
| elapsed = time.time() - start_time | 
| mem = "DRAM" | ||
|  | ||
| if args.role == "client": | ||
| mem = "VRAM" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have some constants for "DRAM", "VRAM", "GDS_MT" and so on?
| nixl_mem_reg_list = [] | ||
| nixl_file_reg_list = [] | ||
|  | ||
| if mem == "VRAM": | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better to use constants for memory/storage types
|  | ||
| elapsed = end_time - start_time | ||
|  | ||
| logger.info(f"Time for {iterations} READ iterations: {elapsed} seconds") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logger.info(f"Time for {iterations} READ iterations: {elapsed} seconds") | |
| logger.info("Time for %s READ iterations: %s seconds", iterations, elapsed) | 
We shouldn't use f-strings in loggers, it's not optimized, pylint would warn about it
| s += 1 | ||
| continue | ||
|  | ||
| if s == iterations: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this flow happen with the s < iterations while condition?
| if s == 0: | ||
| futures.append( | ||
| executor.submit( | ||
| execute_transfer, | ||
| my_agent, | ||
| my_mem_descs, | ||
| my_file_descs, | ||
| my_agent.name, | ||
| "READ", | ||
| ) | ||
| ) | ||
| s += 1 | ||
| continue | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move this line before the loop, initiating s to 1?
I think it would simplify the loop and help avoid a branch
|  | ||
| end_time = time.time() | ||
|  | ||
| logger.info(f"Time for {iterations} iterations: {end_time - start_time} seconds") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-string -> %s logger formatting
| ### Pipelining | ||
|  | ||
| To improve performance of the remote storage server, we can pipeline operations to network and storage. This pipelining allows multiple threads to handle each request. However, in order to maintain correctness, the order of network and storage must happen in order for each individual remote storage operation. To do this, we implemented a simple pipelining scheme. | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tstamler please can you add some more description on how this is implemented with NIXL in this example?
|  | ||
| ### Performance Tips | ||
|  | ||
| For high-speed storage and network hardware, you may need to tweak performance with a couple of environment variables. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide suggested list of env - vars, (like providing example configuration with CX-7, SPX and providing what UCX tuning/GDS tuning that is seen to be beneficial?
|  | ||
| For high-speed storage and network hardware, you may need to tweak performance with a couple of environment variables. | ||
|  | ||
| First, for optimal GDS performance, ensure you are using the GDS_MT backend with default concurrency. Additionally, you can use the cufile options described in the [GDS README](https://github.com/ai-dynamo/nixl/blob/main/src/plugins/cuda_gds/README.md). | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also highlight the difference between the GDS backend and that GDS plugins can work in compat mode by default, and what is required for true GDS support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @roiedanino comment

What?
Update the remote storage example with pipelining code and more details in README.