- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
Closed as not planned
Labels
Description
Your current environment
The output of `python collect_env.py`
How would you like to use vllm
Hi, I’ve encountered a couple of issues while trying the new feat, and I’m hoping to get clarification or assistance.
VLLM container: vllm/vllm-openai:latest
lora rank 8 weight: https://huggingface.co/Akchacha/meta-llama-Meta-Llama-3-8B-Instruct-1726391523/blob/main/adapter_config.json
lora rank 16 weight: https://huggingface.co/Akchacha/meta-llama-Meta-Llama-3-8B-Instruct-1725954636/blob/main/adapter_config.json
server launch cmd
python3 -m vllm.entrypoints.openai.api_server --port 8080 \
 --model /mnt/inference/models/Meta-Llama-3-8B-Instruct \
--served-model-name base-model --enable-lora --max-lora-rank=64 --max-loras=60
First inference request took too long between LoRA ranks:
- Load/unload lora adaptors is working fine.
curl -X POST http://localhost:8080/v1/load_lora_adapter \
    -H "Content-Type: application/json" \
    -d '{"lora_name": "lora8", "lora_path": "/mnt/test/test-lora8"}'
- First forward pass is taking too much time.
curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lora8",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'
When running LoRA module with rank 8 (the first pass), the operation completes very quickly (less than 5 seconds).
However, when running LoRA module  with rank 16 (the first pass), the operation becomes significantly slower, taking around 3 minutes to complete.
Example log
INFO 09-17 23:09:25 logger.py:36] Received request chat-07fb51bb258443939f26c3a8bc0b22a1: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 882, 128007, 271, 15339, 128009, 128006, 78191, 128007, 271], lora_request: LoRARequest(lora_name='lora64', lora_int_id=4, lora_path='/mnt/pvc/samples/lora64', lora_local_path=None, long_lora_max_len=None), prompt_adapter_request: None.
INFO 09-17 23:09:25 async_llm_engine.py:201] Added request chat-07fb51bb258443939f26c3a8bc0b22a1.
DEBUG 09-17 23:09:25 async_llm_engine.py:716] Got new requests!
INFO 09-17 23:09:30 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:09:44 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:09:56 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:10 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:22 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:35 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:46 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:59 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:11:11 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 09-17 23:11:15 models.py:634] Adding lora. Model id: 4, int id: 4, scaling factor: None
DEBUG 09-17 23:11:15 models.py:370] Activating LoRA. int id: 4, slot index: 3
INFO 09-17 23:11:16 metrics.py:351] Avg prompt throughput: 1.9 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:11:17 async_llm_engine.py:169] Finished request chat-07fb51bb258443939f26c3a8bc0b22a1.
Inability to run LoRA module (first pass) & base model simultaneously:
- first query to run Lora (first pass)
##assume we already loaded the lora through /load_lora_adapter endpoint and trying to run the first inference pass
curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lora8",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'
- at the same time, try to run the base model now
curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "base-model",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'
- now the (base model) request will not execute unless the 1. request has passed. Same thing happened with multiple first pass of loaded Lora module.
some finding
I notice the slowdown occur in two function
for lora in   loras.values():
     lora.optimize()
loras[module_name].lora_b = tensor.to(device=device,
                                                      dtype=dtype).t()
                assert embedding_padding_modules is not None
                if any(name in module_name
                       for name in embedding_padding_modules
                       ) and target_embedding_padding is not None:
                    lora_b = loras[module_name].lora_b
                    assert target_embedding_padding >= lora_b.shape[1]
                    addition = target_embedding_padding - lora_b.shape[1]
                    loras[module_name].lora_b = torch.nn.functional.pad(
                        lora_b, (0, addition))
                if pin_memory:
                    loras[module_name].lora_b = loras[
                        module_name].lora_b.pin_memory()
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.