[Core] Integrate `fastsafetensors` loader for loading model weights #10647

manish-sethi · 2024-11-26T00:51:02Z

This PR integrates fastsafetensor library for loading model weights. Fastsafetensor uses GPU direct storage to load the weights directly from storage into GPU memory, making it faster as compared to reading parameters iteratively from file. See https://github.com/foundation-model-stack/fastsafetensors for more details.

Following are the experimental results for the time taken to load the weights from NVMe using the existing safetensor loader v/s fastsafetensor loader.

Model	Tensor_parallel	GPU	Time with safetensor loader(sec)	Time with fastsafetensor loader(sec)
Llama-2-7b-hf	1	A100	7.29	3.67
Llama-2-13b-hf	1	A100	16.04	6.88
Llama-2-13b-hf	2	A100	14.35	6.02
Llama-2-13b-hf	4	L40S	12.39	4.74

The model files are loaded from storage, i.e., not present in the filesystem cache
The time is for only loading the model weights within vLLM model initialization path

Additional performance number comparing three different loaders:

Tensor_parallel	safetensor loader(sec)	runai_streamer loader(sec)	fastsafetensor loader(sec)
1	50.05	36.63	42.40
2	55.98	45.86	42.19
4	60.08	59.98	53.50

The above numbers are with Model Llama-2-13b-hf on L40S GPU
The timing includes end-to-end timing from loading and initializing the model to the generation of the first token with eager-mode
Like before, the model files are loaded from storage, i.e., not present in the filesystem cache

github-actions · 2024-11-26T00:51:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-01-21T19:52:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @manish-sethi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

manoelmarques · 2025-02-06T15:51:13Z

Hi @manish-sethi,

This PR is very useful thank you for creating it.
Do you intend to finish it ? It seems to me it is basically done right ? I used it in some tests.
Also, there are no reviewers added, adding reviewers would accelerate the process.

njhill

Thanks @manish-sethi, this looks great!

It would be awesome to see how this compares to tensorizer and runai-model-streamer performance-wise. Maybe you could add a couple more columns to the table?

For measuring the performance (of all of theses) it would probably be best to have a test that loads the model and then immediately performs a generate request of a single token (i.e. time to first token). To make sure there isn't "hidden" latent loading time (of course also ensuring the file system cache is cold too as normal).

Could you also add a CI test - you can look at what there is for the other model loaders as an example.

docs/source/serving/weights_loading_with_fastsafetensor.md

requirements-cuda.txt

vllm/model_executor/model_loader/loader.py

vllm/model_executor/model_loader/weight_utils.py

mergify · 2025-03-15T00:35:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @manish-sethi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

njhill

Thanks @manish-sethi these updates look really good.

Just one minor inline comment, and also I think you will need to add the dependency to this file for the tests.

If you take the PR out of draft after the above changes we can start to run the tests in the CI.

Also this is mainly a curiosity but for completeness any chance of also adding a column for tensorizer to the new table?

vllm/model_executor/model_loader/weight_utils.py

mergify · 2025-03-18T03:37:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @manish-sethi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Manish Sethi <[email protected]>

manish-sethi · 2025-03-18T04:41:07Z

Thanks @manish-sethi these updates look really good.

Just one minor inline comment, and also I think you will need to add the dependency to this file for the tests.

Added the dependency on test.ini. Can it be assumed that the test machine would have GPUs?

If you take the PR out of draft after the above changes we can start to run the tests in the CI.

Took out of draft

Also this is mainly a curiosity but for completeness any chance of also adding a column for tensorizer to the new table?

I skipped the tensorizer earlier as it was not an apple-to-apple comparison. Still, gave it a try now, but the code seems to be broken. I ran into an error when used this example for serializing the model.

 File "/nvme/manish/repos/vllm/vllm/model_executor/model_loader/tensorizer.py", line 465, in tensorize_vllm_model
    engine.model_executor.collective_rpc(
    ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'LLMEngine' object has no attribute 'model_executor'

manoelmarques · 2025-03-18T16:17:48Z

In my tests integrating this PR to vLLM code and running, it is way slower than safetensor when GDS is not installed.
Is there any test being done for the case where GDS is not installed ?
It would be good to expose the nogds flag in the loader in a next PR. The default is False.

loader = SafeTensorsFileLoader(pg, device, nogds=nogds)

njhill

Thanks @manish-sethi for the great work!

WDYT about @manoelmarques's comment? Would it be worth incorporating a check for GDS?

manish-sethi · 2025-03-20T03:32:11Z

Thanks @manish-sethi for the great work!

WDYT about @manoelmarques's comment? Would it be worth incorporating a check for GDS?

I am not sure if there is a straight forward way to test that programmatically. I guess better would be to open an issue against fastsafetensors repo so it can be exposed as an API. It may require some investigation and also, the fastsafetensors repo is a better place to maintain that code instead of vllm repo from reusability and maintenance purposes.

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]>

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]> Signed-off-by: Wes Medford <[email protected]>

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]>

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]> Signed-off-by: Mu Huai <[email protected]>

zhewenl · 2025-10-28T00:22:28Z

@manish-sethi / @njhill This PR needs to skip tests on AMD, fixing in #27612

mergify bot added documentation Improvements or additions to documentation ci/build labels Nov 26, 2024

mergify bot added the needs-rebase label Jan 21, 2025

maxdebayser mentioned this pull request Jan 23, 2025

[RFC]: Refactor config-format and load-format as plugins #12363

Closed

1 task

manish-sethi force-pushed the fastsafetensor_loader branch from 3c1f242 to b4adb91 Compare March 3, 2025 00:48

mergify bot removed the needs-rebase label Mar 3, 2025

manish-sethi force-pushed the fastsafetensor_loader branch 6 times, most recently from f68ae35 to 664a5ab Compare March 3, 2025 06:02

njhill reviewed Mar 7, 2025

View reviewed changes

manish-sethi force-pushed the fastsafetensor_loader branch from 664a5ab to ca0f33c Compare March 15, 2025 00:34

mergify bot added the needs-rebase label Mar 15, 2025

manish-sethi force-pushed the fastsafetensor_loader branch from ca0f33c to 53b0d90 Compare March 16, 2025 18:10

mergify bot removed the needs-rebase label Mar 16, 2025

manish-sethi force-pushed the fastsafetensor_loader branch 3 times, most recently from cbf0742 to e41f343 Compare March 16, 2025 18:36

manish-sethi mentioned this pull request Mar 16, 2025

Support context manager in loader foundation-model-stack/fastsafetensors#6

Closed

njhill reviewed Mar 17, 2025

View reviewed changes

vllm/model_executor/model_loader/weight_utils.py Outdated Show resolved Hide resolved

manish-sethi force-pushed the fastsafetensor_loader branch from e41f343 to af1fdac Compare March 18, 2025 03:36

mergify bot added the needs-rebase label Mar 18, 2025

fastsafetensor loader integration

8fadb0b

Signed-off-by: Manish Sethi <[email protected]>

manish-sethi force-pushed the fastsafetensor_loader branch from af1fdac to 8fadb0b Compare March 18, 2025 04:22

mergify bot removed the needs-rebase label Mar 18, 2025

manish-sethi marked this pull request as ready for review March 18, 2025 04:31

njhill approved these changes Mar 19, 2025

View reviewed changes

njhill changed the title ~~[Core] Integrate Fastsafetensor loader for loading model weights~~ [Core] Integrate fastsafetensors loader for loading model weights Mar 19, 2025

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 19, 2025

njhill merged commit 761702f into vllm-project:main Mar 24, 2025
67 checks passed

manish-sethi mentioned this pull request Mar 24, 2025

Expose an API to verify whether GDS is installed and enabled on server foundation-model-stack/fastsafetensors#7

Closed

erictang000 pushed a commit to erictang000/vllm that referenced this pull request Mar 25, 2025

[Core] Integrate fastsafetensors loader for loading model weights (v…

930bbcb

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]>

wrmedford pushed a commit to wrmedford/vllm that referenced this pull request Mar 26, 2025

[Core] Integrate fastsafetensors loader for loading model weights (v…

2e1a79e

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]> Signed-off-by: Wes Medford <[email protected]>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Core] Integrate fastsafetensors loader for loading model weights (v…

79f2f43

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Core] Integrate fastsafetensors loader for loading model weights (v…

227393e

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Core] Integrate fastsafetensors loader for loading model weights (v…

5a8b5a3

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Core] Integrate fastsafetensors loader for loading model weights (v…

0fff65b

…llm-project#10647) Signed-off-by: Manish Sethi <[email protected]> Signed-off-by: Mu Huai <[email protected]>

zhewenl mentioned this pull request Oct 28, 2025

[CI Failure]: Model Executor Test failing on AMD due to fastsafetensors not supported #27624

Open

3 tasks

Uh oh!

[Core] Integrate fastsafetensors loader for loading model weights #10647

[Core] Integrate fastsafetensors loader for loading model weights #10647

Uh oh!

Conversation

manish-sethi commented Nov 26, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 26, 2024

Uh oh!

mergify bot commented Jan 21, 2025

Uh oh!

manoelmarques commented Feb 6, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 15, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Mar 18, 2025

Uh oh!

manish-sethi commented Mar 18, 2025

Uh oh!

manoelmarques commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

manish-sethi commented Mar 20, 2025

Uh oh!

Uh oh!

zhewenl commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Core] Integrate `fastsafetensors` loader for loading model weights #10647

[Core] Integrate `fastsafetensors` loader for loading model weights #10647

manish-sethi commented Nov 26, 2024 •

edited by github-actions bot

Loading

manoelmarques commented Mar 18, 2025 •

edited

Loading