Eval bug: server API endpoint not respecting `n_predict` with `-2` (until context filled)

### Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4844 (d76a86d9)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

RTX 3090

### Models

- DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf
- QwQ-32B-Q4_K_M.gguf


### Problem description & steps to reproduce

Run the llama.cpp server then pass a chat completion request with `n_predict = -2` (until context filled)
curl --location 'http://localhost:8080/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer no-key' \
--data '{
    "messages": [
        {
            "role": "user",
            "content": "Vancouver is a city located on the northwestern coast of Canada. It is the largest city in the province of British Columbia, flanked by the Pacific Ocean to the west and the Coast Mountains to the east. What else are special about Vancouver?"
        }
    ],
    "n_predict": -2
}'

Expected behavior: Keep sampling until context size full
Actual behavior: 

```json
{
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "<think>"
            }
        }
    ],
    "created": 1741406445,
    "model": "gpt-3.5-turbo",
    "system_fingerprint": "b4844-d76a86d9",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 1,
        "prompt_tokens": 53,
        "total_tokens": 54
    },
    "id": "chatcmpl-bMQP1cnlEjjlHUSYjIwRiQZayJeHSQ1t",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 203.39,
        "prompt_per_token_ms": 203.39,
        "prompt_per_second": 4.916662569447859,
        "predicted_n": 1,
        "predicted_ms": 0.034,
        "predicted_per_token_ms": 0.034,
        "predicted_per_second": 29411.76470588235
    }
}
```

It's confirmed that the same setting with `n_predict = 32` could set the answer limit to 32 tokens.

### First Bad Commit

_No response_

### Relevant log output

```shell
main: server is listening on http://0.0.0.0:80 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 53
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 53, n_tokens = 53, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 53, n_tokens = 53
slot      release: id  0 | task 0 | stop processing: n_past = 53, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     971.71 ms /    53 tokens (   18.33 ms per token,    54.54 tokens per second)
       eval time =       0.04 ms /     1 tokens (    0.04 ms per token, 25000.00 tokens per second)
      total time =     971.75 ms /    54 tokens
srv  update_slots: all slots are idle
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: server API endpoint not respecting `n_predict` with `-2` (until context filled) #12264

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: server API endpoint not respecting n_predict with -2 (until context filled) #12264

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Eval bug: server API endpoint not respecting `n_predict` with `-2` (until context filled) #12264