Skip to content

Eval bug: server API endpoint not respecting n_predict with -2 (until context filled) #12264

@henryclw

Description

@henryclw

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4844 (d76a86d)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 3090

Models

  • DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf
  • QwQ-32B-Q4_K_M.gguf

Problem description & steps to reproduce

Run the llama.cpp server then pass a chat completion request with n_predict = -2 (until context filled)
curl --location 'http://localhost:8080/v1/chat/completions'
--header 'Content-Type: application/json'
--header 'Authorization: Bearer no-key'
--data '{
"messages": [
{
"role": "user",
"content": "Vancouver is a city located on the northwestern coast of Canada. It is the largest city in the province of British Columbia, flanked by the Pacific Ocean to the west and the Coast Mountains to the east. What else are special about Vancouver?"
}
],
"n_predict": -2
}'

Expected behavior: Keep sampling until context size full
Actual behavior:

{
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "<think>"
            }
        }
    ],
    "created": 1741406445,
    "model": "gpt-3.5-turbo",
    "system_fingerprint": "b4844-d76a86d9",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 1,
        "prompt_tokens": 53,
        "total_tokens": 54
    },
    "id": "chatcmpl-bMQP1cnlEjjlHUSYjIwRiQZayJeHSQ1t",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 203.39,
        "prompt_per_token_ms": 203.39,
        "prompt_per_second": 4.916662569447859,
        "predicted_n": 1,
        "predicted_ms": 0.034,
        "predicted_per_token_ms": 0.034,
        "predicted_per_second": 29411.76470588235
    }
}

It's confirmed that the same setting with n_predict = 32 could set the answer limit to 32 tokens.

First Bad Commit

No response

Relevant log output

main: server is listening on http://0.0.0.0:80 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 53
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 53, n_tokens = 53, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 53, n_tokens = 53
slot      release: id  0 | task 0 | stop processing: n_past = 53, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     971.71 ms /    53 tokens (   18.33 ms per token,    54.54 tokens per second)
       eval time =       0.04 ms /     1 tokens (    0.04 ms per token, 25000.00 tokens per second)
      total time =     971.75 ms /    54 tokens
srv  update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions