Skip to content

[Usage]: Behavior with LoRA Ranks dynamic loading #8559

@zhao-lun

Description

@zhao-lun

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

Hi, I’ve encountered a couple of issues while trying the new feat, and I’m hoping to get clarification or assistance.

VLLM container: vllm/vllm-openai:latest
lora rank 8 weight: https://huggingface.co/Akchacha/meta-llama-Meta-Llama-3-8B-Instruct-1726391523/blob/main/adapter_config.json
lora rank 16 weight: https://huggingface.co/Akchacha/meta-llama-Meta-Llama-3-8B-Instruct-1725954636/blob/main/adapter_config.json
server launch cmd

python3 -m vllm.entrypoints.openai.api_server --port 8080 \
 --model /mnt/inference/models/Meta-Llama-3-8B-Instruct \
--served-model-name base-model --enable-lora --max-lora-rank=64 --max-loras=60

First inference request took too long between LoRA ranks:

  1. Load/unload lora adaptors is working fine.
curl -X POST http://localhost:8080/v1/load_lora_adapter \
    -H "Content-Type: application/json" \
    -d '{"lora_name": "lora8", "lora_path": "/mnt/test/test-lora8"}'
  1. First forward pass is taking too much time.
curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lora8",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'

When running LoRA module with rank 8 (the first pass), the operation completes very quickly (less than 5 seconds).
However, when running LoRA module with rank 16 (the first pass), the operation becomes significantly slower, taking around 3 minutes to complete.

Example log

INFO 09-17 23:09:25 logger.py:36] Received request chat-07fb51bb258443939f26c3a8bc0b22a1: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 882, 128007, 271, 15339, 128009, 128006, 78191, 128007, 271], lora_request: LoRARequest(lora_name='lora64', lora_int_id=4, lora_path='/mnt/pvc/samples/lora64', lora_local_path=None, long_lora_max_len=None), prompt_adapter_request: None.
INFO 09-17 23:09:25 async_llm_engine.py:201] Added request chat-07fb51bb258443939f26c3a8bc0b22a1.
DEBUG 09-17 23:09:25 async_llm_engine.py:716] Got new requests!
INFO 09-17 23:09:30 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:09:44 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:09:56 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:10 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:22 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:35 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:46 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:10:59 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:11:11 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 09-17 23:11:15 models.py:634] Adding lora. Model id: 4, int id: 4, scaling factor: None
DEBUG 09-17 23:11:15 models.py:370] Activating LoRA. int id: 4, slot index: 3
INFO 09-17 23:11:16 metrics.py:351] Avg prompt throughput: 1.9 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-17 23:11:17 async_llm_engine.py:169] Finished request chat-07fb51bb258443939f26c3a8bc0b22a1.

Inability to run LoRA module (first pass) & base model simultaneously:

  1. first query to run Lora (first pass)
##assume we already loaded the lora through /load_lora_adapter endpoint and trying to run the first inference pass
curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lora8",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'
  1. at the same time, try to run the base model now
curl -X POST localhost8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "base-model",
    "messages": [
      {
        "role": "user",
        "content": "Write a short story about a magical forest."
      }
    ],
    "max_tokens": 100
  }'
  1. now the (base model) request will not execute unless the 1. request has passed. Same thing happened with multiple first pass of loaded Lora module.

some finding

I notice the slowdown occur in two function

for lora in   loras.values():
     lora.optimize()
loras[module_name].lora_b = tensor.to(device=device,
                                                      dtype=dtype).t()
                assert embedding_padding_modules is not None
                if any(name in module_name
                       for name in embedding_padding_modules
                       ) and target_embedding_padding is not None:
                    lora_b = loras[module_name].lora_b
                    assert target_embedding_padding >= lora_b.shape[1]
                    addition = target_embedding_padding - lora_b.shape[1]
                    loras[module_name].lora_b = torch.nn.functional.pad(
                        lora_b, (0, addition))
                if pin_memory:
                    loras[module_name].lora_b = loras[
                        module_name].lora_b.pin_memory()

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleOver 90 days of inactivityusageHow to use vllm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions