-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Core] Integrate fastsafetensors loader for loading model weights
#10647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @manish-sethi, This PR is very useful thank you for creating it. |
3c1f242 to
b4adb91
Compare
f68ae35 to
664a5ab
Compare
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @manish-sethi, this looks great!
It would be awesome to see how this compares to tensorizer and runai-model-streamer performance-wise. Maybe you could add a couple more columns to the table?
For measuring the performance (of all of theses) it would probably be best to have a test that loads the model and then immediately performs a generate request of a single token (i.e. time to first token). To make sure there isn't "hidden" latent loading time (of course also ensuring the file system cache is cold too as normal).
Could you also add a CI test - you can look at what there is for the other model loaders as an example.
664a5ab to
ca0f33c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
ca0f33c to
53b0d90
Compare
cbf0742 to
e41f343
Compare
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @manish-sethi these updates look really good.
Just one minor inline comment, and also I think you will need to add the dependency to this file for the tests.
If you take the PR out of draft after the above changes we can start to run the tests in the CI.
Also this is mainly a curiosity but for completeness any chance of also adding a column for tensorizer to the new table?
e41f343 to
af1fdac
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Manish Sethi <[email protected]>
af1fdac to
8fadb0b
Compare
Added the dependency on test.ini. Can it be assumed that the test machine would have GPUs?
Took out of draft
I skipped the |
|
In my tests integrating this PR to vLLM code and running, it is way slower than safetensor when GDS is not installed. |
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @manish-sethi for the great work!
WDYT about @manoelmarques's comment? Would it be worth incorporating a check for GDS?
fastsafetensors loader for loading model weights
I am not sure if there is a straight forward way to test that programmatically. I guess better would be to open an issue against fastsafetensors repo so it can be exposed as an API. It may require some investigation and also, the fastsafetensors repo is a better place to maintain that code instead of vllm repo from reusability and maintenance purposes. |
…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]>
…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]> Signed-off-by: Wes Medford <[email protected]>
…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]>
…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]>
…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]> Signed-off-by: Mu Huai <[email protected]>
|
@manish-sethi / @njhill This PR needs to skip tests on AMD, fixing in #27612 |
This PR integrates fastsafetensor library for loading model weights. Fastsafetensor uses GPU direct storage to load the weights directly from storage into GPU memory, making it faster as compared to reading parameters iteratively from file. See https://github.com/foundation-model-stack/fastsafetensors for more details.
Following are the experimental results for the time taken to load the weights from NVMe using the existing safetensor loader v/s fastsafetensor loader.
safetensor loader(sec)
fastsafetensor loader(sec)
Additional performance number comparing three different loaders: