-
Notifications
You must be signed in to change notification settings - Fork 27
Description
在jetson AGX orin里部署DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf 时没有问题,但是,推理是出现错误。
出现问题的地方应该是这里,jetson AGX Orin的CUDA Arch版本是8.7,而从下边的日志看,【CUDA : ARCHS = 600,610,700,750,800,860,890,900】里边没有870。
[2025-06-14 23:38:50.930306] I system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
完整的日志如下:
[2025-06-14 23:20:28.828328] I
[2025-06-14 23:20:28.828328] I arguments : /home/sunxu/.local/share/pipx/venvs/gpustack/lib/python3.10/site-packages/gpustack/third_party/bin/llama-box/llama-box with arguments: --host 0.0.0.0 --embeddings --gpu-layers 37 --parallel 4 --ctx-size 8192 --port 40038 --model /media/sunxu/llm/gpustack-home/cache/huggingface/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf --alias deepseek-r1-0528-qwen3-8b --no-mmap --no-warmup --ctx-size 32768 --temp 0.6 --top-p 0.95
[2025-06-14 23:20:28.828328] I version : v0.0.154 (53fe21f)
[2025-06-14 23:20:28.828328] I compiler : cc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
[2025-06-14 23:20:28.828328] I target : aarch64-redhat-linux
[2025-06-14 23:20:28.828328] I vendor : llama.cpp 3ac67535 (5586), stable-diffusion.cpp 3eb18db (204), concurrentqueue 2f09da7 (295), readerwriterqueue 16b48ae (166)
[2025-06-14 23:20:28.828584] I ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
[2025-06-14 23:20:28.828584] I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[2025-06-14 23:20:28.828584] I ggml_cuda_init: found 1 CUDA devices:
[2025-06-14 23:20:28.828584] I Device 0: Orin, compute capability 8.7, VMM: yes
[2025-06-14 23:20:28.828586] I system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
[2025-06-14 23:20:28.828586] I
[2025-06-14 23:20:28.828586] I srv load: loading model '/media/sunxu/llm/gpustack-home/cache/huggingface/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf'
[2025-06-14 23:20:28.828638] I llama_model_load_from_file_impl: using device CUDA0 (Orin) - 48433 MiB free
[2025-06-14 23:20:28.828683] I llama_model_loader: loaded meta data with 37 key-value pairs and 399 tensors from /media/sunxu/llm/gpustack-home/cache/huggingface/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf (version GGUF V3 (latest))
[2025-06-14 23:20:28.828683] I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 0: general.architecture str = qwen3
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 1: general.type str = model
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 2: general.name str = Deepseek-R1-0528-Qwen3-8B
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 3: general.basename str = Deepseek-R1-0528-Qwen3-8B
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 4: general.quantized_by str = Unsloth
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 5: general.size_label str = 8B
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 7: qwen3.block_count u32 = 36
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 8: qwen3.context_length u32 = 131072
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 9: qwen3.embedding_length u32 = 4096
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 12288
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 32
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 17: qwen3.rope.scaling.type str = yarn
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 18: qwen3.rope.scaling.factor f32 = 4.000000
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 19: qwen3.rope.scaling.original_context_length u32 = 32768
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
[2025-06-14 23:20:28.828705] I llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
[2025-06-14 23:20:28.828711] I llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 151643
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool = false
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if not add_generation_prompt is d...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 31: general.quantization_version u32 = 2
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 32: general.file_type u32 = 18
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 33: quantize.imatrix.file str = DeepSeek-R1-0528-Qwen3-8B-GGUF/imatri...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 34: quantize.imatrix.dataset str = unsloth_calibration_DeepSeek-R1-0528-...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 35: quantize.imatrix.entries_count i32 = 252
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 36: quantize.imatrix.chunks_count i32 = 713
[2025-06-14 23:20:28.828732] I llama_model_loader: - type f32: 145 tensors
[2025-06-14 23:20:28.828732] I llama_model_loader: - type q8_0: 130 tensors
[2025-06-14 23:20:28.828732] I llama_model_loader: - type q6_K: 124 tensors
[2025-06-14 23:20:28.828732] I print_info: file format = GGUF V3 (latest)
[2025-06-14 23:20:28.828732] I print_info: file type = Q6_K
[2025-06-14 23:20:28.828732] I print_info: file size = 6.97 GiB (7.31 BPW)
[2025-06-14 23:20:28.828938] W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
[2025-06-14 23:20:28.828938] I load: special tokens cache size = 28
[2025-06-14 23:20:28.828995] I load: token to piece cache size = 0.9311 MB
[2025-06-14 23:20:28.828995] I print_info: arch = qwen3
[2025-06-14 23:20:28.828995] I print_info: vocab_only = 0
[2025-06-14 23:20:28.828995] I print_info: n_ctx_train = 131072
[2025-06-14 23:20:28.828995] I print_info: n_embd = 4096
[2025-06-14 23:20:28.828995] I print_info: n_layer = 36
[2025-06-14 23:20:28.828995] I print_info: n_head = 32
[2025-06-14 23:20:28.828995] I print_info: n_head_kv = 8
[2025-06-14 23:20:28.828995] I print_info: n_rot = 128
[2025-06-14 23:20:28.828995] I print_info: n_swa = 0
[2025-06-14 23:20:28.828995] I print_info: is_swa_any = 0
[2025-06-14 23:20:28.828995] I print_info: n_embd_head_k = 128
[2025-06-14 23:20:28.828995] I print_info: n_embd_head_v = 128
[2025-06-14 23:20:28.828995] I print_info: n_gqa = 4
[2025-06-14 23:20:28.828995] I print_info: n_embd_k_gqa = 1024
[2025-06-14 23:20:28.828995] I print_info: n_embd_v_gqa = 1024
[2025-06-14 23:20:28.828995] I print_info: f_norm_eps = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_norm_rms_eps = 1.0e-06
[2025-06-14 23:20:28.828995] I print_info: f_clamp_kqv = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_max_alibi_bias = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_logit_scale = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_attn_scale = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: n_ff = 12288
[2025-06-14 23:20:28.828995] I print_info: n_expert = 0
[2025-06-14 23:20:28.828995] I print_info: n_expert_used = 0
[2025-06-14 23:20:28.828995] I print_info: causal attn = 1
[2025-06-14 23:20:28.828995] I print_info: pooling type = 0
[2025-06-14 23:20:28.828995] I print_info: rope type = 2
[2025-06-14 23:20:28.828995] I print_info: rope scaling = yarn
[2025-06-14 23:20:28.828995] I print_info: freq_base_train = 1000000.0
[2025-06-14 23:20:28.828995] I print_info: freq_scale_train = 0.25
[2025-06-14 23:20:28.828995] I print_info: n_ctx_orig_yarn = 32768
[2025-06-14 23:20:28.828995] I print_info: rope_finetuned = unknown
[2025-06-14 23:20:28.828995] I print_info: ssm_d_conv = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_d_inner = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_d_state = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_dt_rank = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_dt_b_c_rms = 0
[2025-06-14 23:20:28.828995] I print_info: model type = 8B
[2025-06-14 23:20:28.828995] I print_info: model params = 8.19 B
[2025-06-14 23:20:28.828995] I print_info: general.name = Deepseek-R1-0528-Qwen3-8B
[2025-06-14 23:20:28.828995] I print_info: vocab type = BPE
[2025-06-14 23:20:28.828995] I print_info: n_vocab = 151936
[2025-06-14 23:20:28.828995] I print_info: n_merges = 151387
[2025-06-14 23:20:28.828995] I print_info: BOS token = 151643 '<|begin▁of▁sentence|>'
[2025-06-14 23:20:28.828995] I print_info: EOS token = 151645 '<|end▁of▁sentence|>'
[2025-06-14 23:20:28.828995] I print_info: EOT token = 151645 '<|end▁of▁sentence|>'
[2025-06-14 23:20:28.828995] I print_info: PAD token = 151654 '<|vision_pad|>'
[2025-06-14 23:20:28.828995] I print_info: LF token = 198 'Ċ'
[2025-06-14 23:20:28.828995] I print_info: FIM PRE token = 151659 '<|fim_prefix|>'
[2025-06-14 23:20:28.828995] I print_info: FIM SUF token = 151661 '<|fim_suffix|>'
[2025-06-14 23:20:28.828995] I print_info: FIM MID token = 151660 '<|fim_middle|>'
[2025-06-14 23:20:28.828995] I print_info: FIM PAD token = 151662 '<|fim_pad|>'
[2025-06-14 23:20:28.828995] I print_info: FIM REP token = 151663 '<|repo_name|>'
[2025-06-14 23:20:28.828995] I print_info: FIM SEP token = 151664 '<|file_sep|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token = 151645 '<|end▁of▁sentence|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token = 151662 '<|fim_pad|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token = 151663 '<|repo_name|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token = 151664 '<|file_sep|>'
[2025-06-14 23:20:28.828995] I print_info: max token length = 256
[2025-06-14 23:20:28.828995] I load_tensors: loading model tensors, this can take a while... (mmap = false)
[2025-06-14 23:20:29.829218] I load_tensors: offloading 36 repeating layers to GPU
[2025-06-14 23:20:29.829218] I load_tensors: offloading output layer to GPU
[2025-06-14 23:20:29.829218] I load_tensors: offloaded 37/37 layers to GPU
[2025-06-14 23:20:29.829218] I load_tensors: CUDA_Host model buffer size = 630.59 MiB
[2025-06-14 23:20:29.829218] I load_tensors: CUDA0 model buffer size = 6507.27 MiB
[2025-06-14 23:20:29.829429] I loaded 8% model tensors into buffer
[2025-06-14 23:20:29.829546] I loaded 17% model tensors into buffer
[2025-06-14 23:20:29.829550] I loaded 18% model tensors into buffer
[2025-06-14 23:20:29.829573] I loaded 19% model tensors into buffer
[2025-06-14 23:20:29.829582] I loaded 20% model tensors into buffer
[2025-06-14 23:20:29.829601] I loaded 21% model tensors into buffer
[2025-06-14 23:20:29.829615] I loaded 22% model tensors into buffer
[2025-06-14 23:20:29.829625] I loaded 23% model tensors into buffer
[2025-06-14 23:20:29.829645] I loaded 24% model tensors into buffer
[2025-06-14 23:20:29.829656] I loaded 25% model tensors into buffer
[2025-06-14 23:20:29.829673] I loaded 26% model tensors into buffer
[2025-06-14 23:20:29.829694] I loaded 27% model tensors into buffer
[2025-06-14 23:20:29.829704] I loaded 28% model tensors into buffer
[2025-06-14 23:20:29.829722] I loaded 29% model tensors into buffer
[2025-06-14 23:20:29.829739] I loaded 30% model tensors into buffer
[2025-06-14 23:20:29.829754] I loaded 31% model tensors into buffer
[2025-06-14 23:20:29.829770] I loaded 32% model tensors into buffer
[2025-06-14 23:20:29.829787] I loaded 33% model tensors into buffer
[2025-06-14 23:20:29.829797] I loaded 34% model tensors into buffer
[2025-06-14 23:20:29.829809] I loaded 35% model tensors into buffer
[2025-06-14 23:20:29.829820] I loaded 36% model tensors into buffer
[2025-06-14 23:20:29.829839] I loaded 37% model tensors into buffer
[2025-06-14 23:20:29.829853] I loaded 38% model tensors into buffer
[2025-06-14 23:20:29.829861] I loaded 39% model tensors into buffer
[2025-06-14 23:20:29.829872] I loaded 40% model tensors into buffer
[2025-06-14 23:20:29.829893] I loaded 41% model tensors into buffer
[2025-06-14 23:20:29.829900] I loaded 42% model tensors into buffer
[2025-06-14 23:20:29.829921] I loaded 43% model tensors into buffer
[2025-06-14 23:20:29.829937] I loaded 44% model tensors into buffer
[2025-06-14 23:20:29.829951] I loaded 45% model tensors into buffer
[2025-06-14 23:20:29.829960] I loaded 46% model tensors into buffer
[2025-06-14 23:20:29.829974] I loaded 47% model tensors into buffer
[2025-06-14 23:20:29.829989] I loaded 48% model tensors into buffer
[2025-06-14 23:20:29.829999] I loaded 49% model tensors into buffer
[2025-06-14 23:20:30.830018] I loaded 50% model tensors into buffer
[2025-06-14 23:20:30.830027] I loaded 51% model tensors into buffer
[2025-06-14 23:20:30.830044] I loaded 52% model tensors into buffer
[2025-06-14 23:20:30.830060] I loaded 53% model tensors into buffer
[2025-06-14 23:20:30.830081] I loaded 54% model tensors into buffer
[2025-06-14 23:20:30.830099] I loaded 55% model tensors into buffer
[2025-06-14 23:20:30.830116] I loaded 56% model tensors into buffer
[2025-06-14 23:20:30.830132] I loaded 57% model tensors into buffer
[2025-06-14 23:20:30.830152] I loaded 58% model tensors into buffer
[2025-06-14 23:20:30.830170] I loaded 59% model tensors into buffer
[2025-06-14 23:20:30.830181] I loaded 60% model tensors into buffer
[2025-06-14 23:20:30.830194] I loaded 61% model tensors into buffer
[2025-06-14 23:20:30.830213] I loaded 62% model tensors into buffer
[2025-06-14 23:20:30.830230] I loaded 63% model tensors into buffer
[2025-06-14 23:20:30.830247] I loaded 64% model tensors into buffer
[2025-06-14 23:20:30.830263] I loaded 65% model tensors into buffer
[2025-06-14 23:20:30.830271] I loaded 66% model tensors into buffer
[2025-06-14 23:20:30.830287] I loaded 67% model tensors into buffer
[2025-06-14 23:20:30.830302] I loaded 68% model tensors into buffer
[2025-06-14 23:20:30.830316] I loaded 69% model tensors into buffer
[2025-06-14 23:20:30.830335] I loaded 70% model tensors into buffer
[2025-06-14 23:20:30.830355] I loaded 71% model tensors into buffer
[2025-06-14 23:20:30.830368] I loaded 72% model tensors into buffer
[2025-06-14 23:20:30.830385] I loaded 73% model tensors into buffer
[2025-06-14 23:20:30.830410] I loaded 74% model tensors into buffer
[2025-06-14 23:20:30.830426] I loaded 75% model tensors into buffer
[2025-06-14 23:20:30.830448] I loaded 76% model tensors into buffer
[2025-06-14 23:20:30.830464] I loaded 77% model tensors into buffer
[2025-06-14 23:20:30.830472] I loaded 78% model tensors into buffer
[2025-06-14 23:20:30.830487] I loaded 79% model tensors into buffer
[2025-06-14 23:20:30.830499] I loaded 80% model tensors into buffer
[2025-06-14 23:20:30.830520] I loaded 81% model tensors into buffer
[2025-06-14 23:20:30.830539] I loaded 82% model tensors into buffer
[2025-06-14 23:20:30.830551] I loaded 83% model tensors into buffer
[2025-06-14 23:20:30.830567] I loaded 84% model tensors into buffer
[2025-06-14 23:20:30.830582] I loaded 85% model tensors into buffer
[2025-06-14 23:20:30.830597] I loaded 86% model tensors into buffer
[2025-06-14 23:20:30.830615] I loaded 87% model tensors into buffer
[2025-06-14 23:20:30.830634] I loaded 88% model tensors into buffer
[2025-06-14 23:20:30.830641] I loaded 89% model tensors into buffer
[2025-06-14 23:20:30.830661] I loaded 90% model tensors into buffer
[2025-06-14 23:20:30.830682] I loaded 91% model tensors into buffer
[2025-06-14 23:20:30.830699] I loaded 92% model tensors into buffer
[2025-06-14 23:20:30.830710] I loaded 93% model tensors into buffer
[2025-06-14 23:20:30.830730] I loaded 94% model tensors into buffer
[2025-06-14 23:20:30.830755] I loaded 95% model tensors into buffer
[2025-06-14 23:20:30.830774] I loaded 96% model tensors into buffer
[2025-06-14 23:20:30.830790] I loaded 97% model tensors into buffer
[2025-06-14 23:20:30.830807] I loaded 98% model tensors into buffer
[2025-06-14 23:20:30.830817] I loaded 99% model tensors into buffer
[2025-06-14 23:20:30.830828] I loaded 100% model tensors into buffer
[2025-06-14 23:20:30.830830] I llama_context: constructing llama_context
[2025-06-14 23:20:30.830830] I llama_context: n_seq_max = 4
[2025-06-14 23:20:30.830830] I llama_context: n_ctx = 32768
[2025-06-14 23:20:30.830830] I llama_context: n_batch = 2048
[2025-06-14 23:20:30.830830] I llama_context: n_ubatch = 512
[2025-06-14 23:20:30.830830] I llama_context: causal_attn = 1
[2025-06-14 23:20:30.830830] I llama_context: flash_attn = 0
[2025-06-14 23:20:30.830830] I llama_context: freq_base = 1000000.0
[2025-06-14 23:20:30.830830] I llama_context: freq_scale = 0.25
[2025-06-14 23:20:30.830830] W llama_context: n_ctx (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[2025-06-14 23:20:30.830830] W llama_context: requested n_seq_max (4) > 1, but swa_full is not enabled -- performance may be degraded: https://github.com/ggml-org/llama.cpp/pull/13845#issuecomment-2924800573
[2025-06-14 23:20:30.830830] I llama_context: CUDA_Host output buffer size = 0.06 MiB
[2025-06-14 23:20:31.831760] I llama_kv_cache_unified: CUDA0 KV buffer size = 4608.00 MiB
[2025-06-14 23:20:31.831837] I llama_kv_cache_unified: size = 4608.00 MiB ( 32768 cells, 36 layers, 4 seqs), K (f16): 2304.00 MiB, V (f16): 2304.00 MiB
[2025-06-14 23:20:32.832272] I llama_context: CUDA0 compute buffer size = 2080.00 MiB
[2025-06-14 23:20:32.832272] I llama_context: CUDA_Host compute buffer size = 64.01 MiB
[2025-06-14 23:20:32.832272] I llama_context: graph nodes = 1446
[2025-06-14 23:20:32.832272] I llama_context: graph splits = 1
[2025-06-14 23:20:32.832272] I common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
[2025-06-14 23:20:32.832277] I srv load: prompt caching enabled
[2025-06-14 23:20:32.832277] I srv load: context shifting enabled
[2025-06-14 23:20:32.832337] I srv load: chat template, alias: deepseek3, built-in: true, jinja rendering: disabled, tool call: supported, reasoning: supported, example:
You are a helpful assistant.
You CAN call tools to assist with the user query. Do not make assumptions about what values to plug into tools.
You are provided with following tools:
get_weather
{
"name": "get_weather",
"description": "",
"parameters": {"type":"object","properties":{"location":{"type":"string"}}}
}get_temperature
{
"name": "get_temperature",
"description": "Return the temperature according to the location.",
"parameters": {"type":"object","properties":{"location":{"type":"string"}}}
}For each tool call, just generate an answer, no explanation before or after your answer, MUST return as below:
<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>The name of the tool
The input of the tool, must be an JSON object in compact format
```<|tool▁call▁end|><|tool▁calls▁end|><|end▁of▁sentence|>
<|User|>Hello.<|Assistant|>Hi! How can I help you today?<|end▁of▁sentence|><|User|>What is the weather like in Beijing?<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>get_weather
```json
{"location":"Beijing"}
```<|tool▁call▁end|><|tool▁calls▁end|><|end▁of▁sentence|><|tool▁outputs▁begin|><|tool▁output▁begin|>{"weather":"Sunny"}<|tool▁output▁end|><|tool▁outputs▁end|>The weather is Sunny.<|end▁of▁sentence|><|User|>What is the temperature in Beijing?<|Assistant|>
[2025-06-14 23:20:32.832337] I srv load: tool call trigger, start: words, end: words
[2025-06-14 23:20:32.832337] I srv start: starting
[2025-06-14 23:20:32.832338] I srv start: listening host = 0.0.0.0, port = 40038
[2025-06-14 23:20:32.832338] I srv start: server is ready
[2025-06-14 23:22:34.954811] I srv request: rid 526546878654 | GET /v1/models 127.0.0.1:57106
[2025-06-14 23:22:34.954811] I srv response: rid 526546878654 | GET /v1/models 127.0.0.1:57106 | status 200 | cost 0.43ms | opened
[2025-06-14 23:22:38.958788] I srv request: rid 526550856108 | GET /v1/models 127.0.0.1:42040
[2025-06-14 23:22:38.958789] I srv response: rid 526550856108 | GET /v1/models 127.0.0.1:42040 | status 200 | cost 0.31ms | opened
[2025-06-14 23:22:43.963405] I srv request: rid 526555473395 | GET /v1/models 127.0.0.1:42054
[2025-06-14 23:22:43.963406] I srv response: rid 526555473395 | GET /v1/models 127.0.0.1:42054 | status 200 | cost 0.51ms | opened
[2025-06-14 23:22:57.977901] I srv request: rid 526569969169 | GET /v1/models 127.0.0.1:47642
[2025-06-14 23:22:57.977902] I srv response: rid 526569969169 | GET /v1/models 127.0.0.1:47642 | status 200 | cost 0.39ms | opened
[2025-06-14 23:22:57.977906] I srv request: rid 526569973633 | POST /v1/chat/completions 127.0.0.1:47656
/home/runner/work/llama-box/llama-box/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
[2025-06-14 23:22:58.978137] E ggml_cuda_compute_forward: GET_ROWS failed
[2025-06-14 23:22:58.978137] E CUDA error: no kernel image is available for execution on the device
[2025-06-14 23:22:58.978137] E current device: 0, in function ggml_cuda_compute_forward at /home/runner/work/llama-box/llama-box/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2366
[2025-06-14 23:22:58.978137] E err
[New LWP 1432801]
[New LWP 1432807]
[New LWP 1432808]
[New LWP 1432865]
[New LWP 1432866]
[New LWP 1432867]
[New LWP 1432868]
[New LWP 1432869]
[New LWP 1432870]
[New LWP 1432871]
[New LWP 1432872]
[New LWP 1432873]
[New LWP 1432874]
[New LWP 1432875]
[New LWP 1432876]
[New LWP 1432877]
[New LWP 1432878]
[New LWP 1432879]
[New LWP 1432880]
[New LWP 1432881]
[New LWP 1432882]
[New LWP 1432883]
[New LWP 1432884]
[New LWP 1432885]
[New LWP 1432886]
[New LWP 1432887]
[New LWP 1432888]
[New LWP 1432889]
[New LWP 1432890]
[New LWP 1432891]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffff72c3be38 in __GI___poll (fds=0xffffc1beea88, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
#0 0x0000ffff72c3be38 in __GI___poll (fds=0xffffc1beea88, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41 in ../sysdeps/unix/sysv/linux/poll.c
#1 0x00000000004a0e78 in httplib::Server::listen_internal() ()
#2 0x000000000054215c in httpserver::start() ()
#3 0x000000000040eb20 in main ()
[Inferior 1 (process 1432799) detached]
Aborted
-----------------------------------------------------------------------------------------------