-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4844 (d76a86d)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX 3090
Models
- DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf
- QwQ-32B-Q4_K_M.gguf
Problem description & steps to reproduce
Run the llama.cpp server then pass a chat completion request with n_predict = -2 (until context filled)
curl --location 'http://localhost:8080/v1/chat/completions'
--header 'Content-Type: application/json'
--header 'Authorization: Bearer no-key'
--data '{
"messages": [
{
"role": "user",
"content": "Vancouver is a city located on the northwestern coast of Canada. It is the largest city in the province of British Columbia, flanked by the Pacific Ocean to the west and the Coast Mountains to the east. What else are special about Vancouver?"
}
],
"n_predict": -2
}'
Expected behavior: Keep sampling until context size full
Actual behavior:
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"role": "assistant",
"content": "<think>"
}
}
],
"created": 1741406445,
"model": "gpt-3.5-turbo",
"system_fingerprint": "b4844-d76a86d9",
"object": "chat.completion",
"usage": {
"completion_tokens": 1,
"prompt_tokens": 53,
"total_tokens": 54
},
"id": "chatcmpl-bMQP1cnlEjjlHUSYjIwRiQZayJeHSQ1t",
"timings": {
"prompt_n": 1,
"prompt_ms": 203.39,
"prompt_per_token_ms": 203.39,
"prompt_per_second": 4.916662569447859,
"predicted_n": 1,
"predicted_ms": 0.034,
"predicted_per_token_ms": 0.034,
"predicted_per_second": 29411.76470588235
}
}It's confirmed that the same setting with n_predict = 32 could set the answer limit to 32 tokens.
First Bad Commit
No response
Relevant log output
main: server is listening on http://0.0.0.0:80 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 53
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 53, n_tokens = 53, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 53, n_tokens = 53
slot release: id 0 | task 0 | stop processing: n_past = 53, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 971.71 ms / 53 tokens ( 18.33 ms per token, 54.54 tokens per second)
eval time = 0.04 ms / 1 tokens ( 0.04 ms per token, 25000.00 tokens per second)
total time = 971.75 ms / 54 tokens
srv update_slots: all slots are idle