-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed
Labels
feature requestNew feature or requestNew feature or request
Description
🚀 The feature, motivation and pitch
As someone trying to understand the components of latency in vLLM, I would like the logging from vLLM to distinguish between (a) the time to download a model from HuggingFace (or wherever) into the local filesystem and (b) the time to load the model weights from the local filesystem. These are two steps done in series, right?
Following is an example of the logging that I got from release 0.7.2 in V1 mode; I do not see how to distinguish these two components of latency.
INFO 02-07 19:23:56 gpu_model_runner.py:867] Starting to load model ibm-granite/granite-3.0-3b-a800m-instruct...
INFO 02-07 19:23:56 cuda.py:158] Using Flash Attention backend on V1 engine.
WARNING 02-07 19:23:56 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 02-07 19:23:57 weight_utils.py:252] Using model weights format ['*.safetensors']
model-00002-of-00002.safetensors: 100%|█████████████████████████████████| 1.75G/1.75G [00:56<00:00, 30.8MB/s]
model-00001-of-00002.safetensors: 100%|█████████████████████████████████| 5.00G/5.00G [02:47<00:00, 29.8MB/s]
model.safetensors.index.json: 100%|█████████████████████████████████████| 25.6k/25.6k [00:00<00:00, 1.87MB/s]
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 119.74it/s]
INFO 02-07 19:26:48 gpu_model_runner.py:872] Loading model weights took 6.1506 GB
Alternatives
I suppose that I could get nearly the same thing by using vLLM twice, once to download and load weights and once to just load the weights. But that seems like a lot more trouble than just getting a useful log line.
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request