Skip to content

[Feature]: log download time and model weights load time separately #12916

@MikeSpreitzer

Description

@MikeSpreitzer

🚀 The feature, motivation and pitch

As someone trying to understand the components of latency in vLLM, I would like the logging from vLLM to distinguish between (a) the time to download a model from HuggingFace (or wherever) into the local filesystem and (b) the time to load the model weights from the local filesystem. These are two steps done in series, right?

Following is an example of the logging that I got from release 0.7.2 in V1 mode; I do not see how to distinguish these two components of latency.

INFO 02-07 19:23:56 gpu_model_runner.py:867] Starting to load model ibm-granite/granite-3.0-3b-a800m-instruct...
INFO 02-07 19:23:56 cuda.py:158] Using Flash Attention backend on V1 engine.
WARNING 02-07 19:23:56 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 02-07 19:23:57 weight_utils.py:252] Using model weights format ['*.safetensors']
model-00002-of-00002.safetensors: 100%|█████████████████████████████████| 1.75G/1.75G [00:56<00:00, 30.8MB/s]
model-00001-of-00002.safetensors: 100%|█████████████████████████████████| 5.00G/5.00G [02:47<00:00, 29.8MB/s]
model.safetensors.index.json: 100%|█████████████████████████████████████| 25.6k/25.6k [00:00<00:00, 1.87MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 119.74it/s]

INFO 02-07 19:26:48 gpu_model_runner.py:872] Loading model weights took 6.1506 GB

Alternatives

I suppose that I could get nearly the same thing by using vLLM twice, once to download and load weights and once to just load the weights. But that seems like a lot more trouble than just getting a useful log line.

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions