[Bug]: vllm 0.6.3 generates incomplete/repeated answers for long length (over 8k) input

### Your current environment

<details>
<summary>vllm 0.6.3</summary>

</details>


### Model Input Dumps

The input is long context with over 8k tokens

### 🐛 Describe the bug

1. vllm 0.6.2 does not have this bug.
2. We are running vllm 0.6.3 with speculative decoding. When we input long context (over 8k) into the model, the output is truncated and gives incomplete answers. The command we are using is 
```
python -m vllm.entrypoints.openai.api_server  --host 0.0.0.0  --port 8083  --model /home/downloaded_model/Llama-3.2-3B-Instruct/  --speculative_model /home/downloaded_model/Llama-3.2-1B-Instruct/  --served-model-name  LLM  --tensor-parallel-size 8  --max-model-len 34336  --max-num-seqs 128  --enable-prefix-caching --disable-log-requests --use-v2-block-manager --seed 42 --num_speculative_tokens 5  --gpu_memory_utilization 0.95  --spec-decoding-acceptance-method typical_acceptance_sampler
```

3. We then run vllm 0.6.3 without speculative decoding, but we still get incomplete answers or repeated answers. The command we use is 
```
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /home/downloaded_model/Llama-3.2-3B-Instruct/ --served-model-name  LLM --tensor-parallel-size 8 --max-model-len 34336 --max-num-seqs 128 --enable-prefix-caching --enable_chunked_prefill --disable-log-requests --seed 42 --gpu_memory_utilization 0.95
```

4. How we call vllm model is as below
```
def call_vllm_api(message_log):
    vllm_client = openai.OpenAI(api_key=API_KEY, base_url=BASE_URL)
    
    response = vllm_client.chat.completions.create(
        model="LLM",
        messages=message_log,
        max_tokens=4096,
        temperature=0.2,
        presence_penalty=0,
        frequency_penalty=0,
    )
    
    response_content = response.choices[0].message.content
    
    return response_content
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: vllm 0.6.3 generates incomplete/repeated answers for long length (over 8k) input #9448

Your current environment

Model Input Dumps

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: vllm 0.6.3 generates incomplete/repeated answers for long length (over 8k) input #9448

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions