-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
vllm 0.6.3
Model Input Dumps
The input is long context with over 8k tokens
🐛 Describe the bug
- vllm 0.6.2 does not have this bug.
- We are running vllm 0.6.3 with speculative decoding. When we input long context (over 8k) into the model, the output is truncated and gives incomplete answers. The command we are using is
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /home/downloaded_model/Llama-3.2-3B-Instruct/ --speculative_model /home/downloaded_model/Llama-3.2-1B-Instruct/ --served-model-name LLM --tensor-parallel-size 8 --max-model-len 34336 --max-num-seqs 128 --enable-prefix-caching --disable-log-requests --use-v2-block-manager --seed 42 --num_speculative_tokens 5 --gpu_memory_utilization 0.95 --spec-decoding-acceptance-method typical_acceptance_sampler
- We then run vllm 0.6.3 without speculative decoding, but we still get incomplete answers or repeated answers. The command we use is
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /home/downloaded_model/Llama-3.2-3B-Instruct/ --served-model-name LLM --tensor-parallel-size 8 --max-model-len 34336 --max-num-seqs 128 --enable-prefix-caching --enable_chunked_prefill --disable-log-requests --seed 42 --gpu_memory_utilization 0.95
- How we call vllm model is as below
def call_vllm_api(message_log):
vllm_client = openai.OpenAI(api_key=API_KEY, base_url=BASE_URL)
response = vllm_client.chat.completions.create(
model="LLM",
messages=message_log,
max_tokens=4096,
temperature=0.2,
presence_penalty=0,
frequency_penalty=0,
)
response_content = response.choices[0].message.content
return response_content
petricevich, sanketrai and dhandhalyabhavikdhandhalyabhavikdhandhalyabhavik
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working