Skip to content

Conversation

@manish-sethi
Copy link
Contributor

@manish-sethi manish-sethi commented Nov 26, 2024

This PR integrates fastsafetensor library for loading model weights. Fastsafetensor uses GPU direct storage to load the weights directly from storage into GPU memory, making it faster as compared to reading parameters iteratively from file. See https://github.com/foundation-model-stack/fastsafetensors for more details.

Following are the experimental results for the time taken to load the weights from NVMe using the existing safetensor loader v/s fastsafetensor loader.

Model Tensor_parallel GPU Time with
safetensor loader(sec)
Time with
fastsafetensor loader(sec)
Llama-2-7b-hf 1 A100 7.29 3.67
Llama-2-13b-hf 1 A100 16.04 6.88
Llama-2-13b-hf 2 A100 14.35 6.02
Llama-2-13b-hf 4 L40S 12.39 4.74
  • The model files are loaded from storage, i.e., not present in the filesystem cache
  • The time is for only loading the model weights within vLLM model initialization path

Additional performance number comparing three different loaders:

Tensor_parallel safetensor loader(sec) runai_streamer loader(sec) fastsafetensor loader(sec)
1 50.05 36.63 42.40
2 55.98 45.86 42.19
4 60.08 59.98 53.50
  • The above numbers are with Model Llama-2-13b-hf on L40S GPU
  • The timing includes end-to-end timing from loading and initializing the model to the generation of the first token with eager-mode
  • Like before, the model files are loaded from storage, i.e., not present in the filesystem cache

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels Nov 26, 2024
@mergify
Copy link

mergify bot commented Jan 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @manish-sethi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@manoelmarques
Copy link
Contributor

Hi @manish-sethi,

This PR is very useful thank you for creating it.
Do you intend to finish it ? It seems to me it is basically done right ? I used it in some tests.
Also, there are no reviewers added, adding reviewers would accelerate the process.

@manish-sethi manish-sethi force-pushed the fastsafetensor_loader branch from 3c1f242 to b4adb91 Compare March 3, 2025 00:48
@mergify mergify bot removed the needs-rebase label Mar 3, 2025
@manish-sethi manish-sethi force-pushed the fastsafetensor_loader branch 6 times, most recently from f68ae35 to 664a5ab Compare March 3, 2025 06:02
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @manish-sethi, this looks great!

It would be awesome to see how this compares to tensorizer and runai-model-streamer performance-wise. Maybe you could add a couple more columns to the table?

For measuring the performance (of all of theses) it would probably be best to have a test that loads the model and then immediately performs a generate request of a single token (i.e. time to first token). To make sure there isn't "hidden" latent loading time (of course also ensuring the file system cache is cold too as normal).

Could you also add a CI test - you can look at what there is for the other model loaders as an example.

@manish-sethi manish-sethi force-pushed the fastsafetensor_loader branch from 664a5ab to ca0f33c Compare March 15, 2025 00:34
@mergify
Copy link

mergify bot commented Mar 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @manish-sethi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @manish-sethi these updates look really good.

Just one minor inline comment, and also I think you will need to add the dependency to this file for the tests.

If you take the PR out of draft after the above changes we can start to run the tests in the CI.

Also this is mainly a curiosity but for completeness any chance of also adding a column for tensorizer to the new table?

@manish-sethi manish-sethi force-pushed the fastsafetensor_loader branch from e41f343 to af1fdac Compare March 18, 2025 03:36
@mergify
Copy link

mergify bot commented Mar 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @manish-sethi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 18, 2025
@manish-sethi manish-sethi force-pushed the fastsafetensor_loader branch from af1fdac to 8fadb0b Compare March 18, 2025 04:22
@mergify mergify bot removed the needs-rebase label Mar 18, 2025
@manish-sethi manish-sethi marked this pull request as ready for review March 18, 2025 04:31
@manish-sethi
Copy link
Contributor Author

Thanks @manish-sethi these updates look really good.

Just one minor inline comment, and also I think you will need to add the dependency to this file for the tests.

Added the dependency on test.ini. Can it be assumed that the test machine would have GPUs?

If you take the PR out of draft after the above changes we can start to run the tests in the CI.

Took out of draft

Also this is mainly a curiosity but for completeness any chance of also adding a column for tensorizer to the new table?

I skipped the tensorizer earlier as it was not an apple-to-apple comparison. Still, gave it a try now, but the code seems to be broken. I ran into an error when used this example for serializing the model.

 File "/nvme/manish/repos/vllm/vllm/model_executor/model_loader/tensorizer.py", line 465, in tensorize_vllm_model
    engine.model_executor.collective_rpc(
    ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'LLMEngine' object has no attribute 'model_executor'

@manoelmarques
Copy link
Contributor

manoelmarques commented Mar 18, 2025

In my tests integrating this PR to vLLM code and running, it is way slower than safetensor when GDS is not installed.
Is there any test being done for the case where GDS is not installed ?
It would be good to expose the nogds flag in the loader in a next PR. The default is False.

loader = SafeTensorsFileLoader(pg, device, nogds=nogds)

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @manish-sethi for the great work!

WDYT about @manoelmarques's comment? Would it be worth incorporating a check for GDS?

@njhill njhill changed the title [Core] Integrate Fastsafetensor loader for loading model weights [Core] Integrate fastsafetensors loader for loading model weights Mar 19, 2025
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 19, 2025
@manish-sethi
Copy link
Contributor Author

Thanks @manish-sethi for the great work!

WDYT about @manoelmarques's comment? Would it be worth incorporating a check for GDS?

I am not sure if there is a straight forward way to test that programmatically. I guess better would be to open an issue against fastsafetensors repo so it can be exposed as an API. It may require some investigation and also, the fastsafetensors repo is a better place to maintain that code instead of vllm repo from reusability and maintenance purposes.

@njhill njhill merged commit 761702f into vllm-project:main Mar 24, 2025
67 checks passed
erictang000 pushed a commit to erictang000/vllm that referenced this pull request Mar 25, 2025
wrmedford pushed a commit to wrmedford/vllm that referenced this pull request Mar 26, 2025
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
@zhewenl
Copy link
Collaborator

zhewenl commented Oct 28, 2025

@manish-sethi / @njhill This PR needs to skip tests on AMD, fixing in #27612

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants