Skip to content

jetson AGX orin里的CUDA Arch不支持 #54

@sonkyokukou

Description

@sonkyokukou

在jetson AGX orin里部署DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf 时没有问题,但是,推理是出现错误。

出现问题的地方应该是这里,jetson AGX Orin的CUDA Arch版本是8.7,而从下边的日志看,【CUDA : ARCHS = 600,610,700,750,800,860,890,900】里边没有870。

[2025-06-14 23:38:50.930306] I system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

完整的日志如下:

[2025-06-14 23:20:28.828328] I
[2025-06-14 23:20:28.828328] I arguments : /home/sunxu/.local/share/pipx/venvs/gpustack/lib/python3.10/site-packages/gpustack/third_party/bin/llama-box/llama-box with arguments: --host 0.0.0.0 --embeddings --gpu-layers 37 --parallel 4 --ctx-size 8192 --port 40038 --model /media/sunxu/llm/gpustack-home/cache/huggingface/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf --alias deepseek-r1-0528-qwen3-8b --no-mmap --no-warmup --ctx-size 32768 --temp 0.6 --top-p 0.95
[2025-06-14 23:20:28.828328] I version : v0.0.154 (53fe21f)
[2025-06-14 23:20:28.828328] I compiler : cc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
[2025-06-14 23:20:28.828328] I target : aarch64-redhat-linux
[2025-06-14 23:20:28.828328] I vendor : llama.cpp 3ac67535 (5586), stable-diffusion.cpp 3eb18db (204), concurrentqueue 2f09da7 (295), readerwriterqueue 16b48ae (166)
[2025-06-14 23:20:28.828584] I ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
[2025-06-14 23:20:28.828584] I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[2025-06-14 23:20:28.828584] I ggml_cuda_init: found 1 CUDA devices:
[2025-06-14 23:20:28.828584] I Device 0: Orin, compute capability 8.7, VMM: yes
[2025-06-14 23:20:28.828586] I system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |
[2025-06-14 23:20:28.828586] I
[2025-06-14 23:20:28.828586] I srv load: loading model '/media/sunxu/llm/gpustack-home/cache/huggingface/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf'
[2025-06-14 23:20:28.828638] I llama_model_load_from_file_impl: using device CUDA0 (Orin) - 48433 MiB free
[2025-06-14 23:20:28.828683] I llama_model_loader: loaded meta data with 37 key-value pairs and 399 tensors from /media/sunxu/llm/gpustack-home/cache/huggingface/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf (version GGUF V3 (latest))
[2025-06-14 23:20:28.828683] I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 0: general.architecture str = qwen3
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 1: general.type str = model
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 2: general.name str = Deepseek-R1-0528-Qwen3-8B
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 3: general.basename str = Deepseek-R1-0528-Qwen3-8B
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 4: general.quantized_by str = Unsloth
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 5: general.size_label str = 8B
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 7: qwen3.block_count u32 = 36
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 8: qwen3.context_length u32 = 131072
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 9: qwen3.embedding_length u32 = 4096
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 12288
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 32
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 17: qwen3.rope.scaling.type str = yarn
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 18: qwen3.rope.scaling.factor f32 = 4.000000
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 19: qwen3.rope.scaling.original_context_length u32 = 32768
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
[2025-06-14 23:20:28.828705] I llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
[2025-06-14 23:20:28.828711] I llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 151643
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool = false
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if not add_generation_prompt is d...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 31: general.quantization_version u32 = 2
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 32: general.file_type u32 = 18
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 33: quantize.imatrix.file str = DeepSeek-R1-0528-Qwen3-8B-GGUF/imatri...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 34: quantize.imatrix.dataset str = unsloth_calibration_DeepSeek-R1-0528-...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 35: quantize.imatrix.entries_count i32 = 252
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv 36: quantize.imatrix.chunks_count i32 = 713
[2025-06-14 23:20:28.828732] I llama_model_loader: - type f32: 145 tensors
[2025-06-14 23:20:28.828732] I llama_model_loader: - type q8_0: 130 tensors
[2025-06-14 23:20:28.828732] I llama_model_loader: - type q6_K: 124 tensors
[2025-06-14 23:20:28.828732] I print_info: file format = GGUF V3 (latest)
[2025-06-14 23:20:28.828732] I print_info: file type = Q6_K
[2025-06-14 23:20:28.828732] I print_info: file size = 6.97 GiB (7.31 BPW)
[2025-06-14 23:20:28.828938] W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
[2025-06-14 23:20:28.828938] I load: special tokens cache size = 28
[2025-06-14 23:20:28.828995] I load: token to piece cache size = 0.9311 MB
[2025-06-14 23:20:28.828995] I print_info: arch = qwen3
[2025-06-14 23:20:28.828995] I print_info: vocab_only = 0
[2025-06-14 23:20:28.828995] I print_info: n_ctx_train = 131072
[2025-06-14 23:20:28.828995] I print_info: n_embd = 4096
[2025-06-14 23:20:28.828995] I print_info: n_layer = 36
[2025-06-14 23:20:28.828995] I print_info: n_head = 32
[2025-06-14 23:20:28.828995] I print_info: n_head_kv = 8
[2025-06-14 23:20:28.828995] I print_info: n_rot = 128
[2025-06-14 23:20:28.828995] I print_info: n_swa = 0
[2025-06-14 23:20:28.828995] I print_info: is_swa_any = 0
[2025-06-14 23:20:28.828995] I print_info: n_embd_head_k = 128
[2025-06-14 23:20:28.828995] I print_info: n_embd_head_v = 128
[2025-06-14 23:20:28.828995] I print_info: n_gqa = 4
[2025-06-14 23:20:28.828995] I print_info: n_embd_k_gqa = 1024
[2025-06-14 23:20:28.828995] I print_info: n_embd_v_gqa = 1024
[2025-06-14 23:20:28.828995] I print_info: f_norm_eps = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_norm_rms_eps = 1.0e-06
[2025-06-14 23:20:28.828995] I print_info: f_clamp_kqv = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_max_alibi_bias = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_logit_scale = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_attn_scale = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: n_ff = 12288
[2025-06-14 23:20:28.828995] I print_info: n_expert = 0
[2025-06-14 23:20:28.828995] I print_info: n_expert_used = 0
[2025-06-14 23:20:28.828995] I print_info: causal attn = 1
[2025-06-14 23:20:28.828995] I print_info: pooling type = 0
[2025-06-14 23:20:28.828995] I print_info: rope type = 2
[2025-06-14 23:20:28.828995] I print_info: rope scaling = yarn
[2025-06-14 23:20:28.828995] I print_info: freq_base_train = 1000000.0
[2025-06-14 23:20:28.828995] I print_info: freq_scale_train = 0.25
[2025-06-14 23:20:28.828995] I print_info: n_ctx_orig_yarn = 32768
[2025-06-14 23:20:28.828995] I print_info: rope_finetuned = unknown
[2025-06-14 23:20:28.828995] I print_info: ssm_d_conv = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_d_inner = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_d_state = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_dt_rank = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_dt_b_c_rms = 0
[2025-06-14 23:20:28.828995] I print_info: model type = 8B
[2025-06-14 23:20:28.828995] I print_info: model params = 8.19 B
[2025-06-14 23:20:28.828995] I print_info: general.name = Deepseek-R1-0528-Qwen3-8B
[2025-06-14 23:20:28.828995] I print_info: vocab type = BPE
[2025-06-14 23:20:28.828995] I print_info: n_vocab = 151936
[2025-06-14 23:20:28.828995] I print_info: n_merges = 151387
[2025-06-14 23:20:28.828995] I print_info: BOS token = 151643 '<|begin▁of▁sentence|>'
[2025-06-14 23:20:28.828995] I print_info: EOS token = 151645 '<|end▁of▁sentence|>'
[2025-06-14 23:20:28.828995] I print_info: EOT token = 151645 '<|end▁of▁sentence|>'
[2025-06-14 23:20:28.828995] I print_info: PAD token = 151654 '<|vision_pad|>'
[2025-06-14 23:20:28.828995] I print_info: LF token = 198 'Ċ'
[2025-06-14 23:20:28.828995] I print_info: FIM PRE token = 151659 '<|fim_prefix|>'
[2025-06-14 23:20:28.828995] I print_info: FIM SUF token = 151661 '<|fim_suffix|>'
[2025-06-14 23:20:28.828995] I print_info: FIM MID token = 151660 '<|fim_middle|>'
[2025-06-14 23:20:28.828995] I print_info: FIM PAD token = 151662 '<|fim_pad|>'
[2025-06-14 23:20:28.828995] I print_info: FIM REP token = 151663 '<|repo_name|>'
[2025-06-14 23:20:28.828995] I print_info: FIM SEP token = 151664 '<|file_sep|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token = 151645 '<|end▁of▁sentence|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token = 151662 '<|fim_pad|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token = 151663 '<|repo_name|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token = 151664 '<|file_sep|>'
[2025-06-14 23:20:28.828995] I print_info: max token length = 256
[2025-06-14 23:20:28.828995] I load_tensors: loading model tensors, this can take a while... (mmap = false)
[2025-06-14 23:20:29.829218] I load_tensors: offloading 36 repeating layers to GPU
[2025-06-14 23:20:29.829218] I load_tensors: offloading output layer to GPU
[2025-06-14 23:20:29.829218] I load_tensors: offloaded 37/37 layers to GPU
[2025-06-14 23:20:29.829218] I load_tensors: CUDA_Host model buffer size = 630.59 MiB
[2025-06-14 23:20:29.829218] I load_tensors: CUDA0 model buffer size = 6507.27 MiB
[2025-06-14 23:20:29.829429] I loaded 8% model tensors into buffer
[2025-06-14 23:20:29.829546] I loaded 17% model tensors into buffer
[2025-06-14 23:20:29.829550] I loaded 18% model tensors into buffer
[2025-06-14 23:20:29.829573] I loaded 19% model tensors into buffer
[2025-06-14 23:20:29.829582] I loaded 20% model tensors into buffer
[2025-06-14 23:20:29.829601] I loaded 21% model tensors into buffer
[2025-06-14 23:20:29.829615] I loaded 22% model tensors into buffer
[2025-06-14 23:20:29.829625] I loaded 23% model tensors into buffer
[2025-06-14 23:20:29.829645] I loaded 24% model tensors into buffer
[2025-06-14 23:20:29.829656] I loaded 25% model tensors into buffer
[2025-06-14 23:20:29.829673] I loaded 26% model tensors into buffer
[2025-06-14 23:20:29.829694] I loaded 27% model tensors into buffer
[2025-06-14 23:20:29.829704] I loaded 28% model tensors into buffer
[2025-06-14 23:20:29.829722] I loaded 29% model tensors into buffer
[2025-06-14 23:20:29.829739] I loaded 30% model tensors into buffer
[2025-06-14 23:20:29.829754] I loaded 31% model tensors into buffer
[2025-06-14 23:20:29.829770] I loaded 32% model tensors into buffer
[2025-06-14 23:20:29.829787] I loaded 33% model tensors into buffer
[2025-06-14 23:20:29.829797] I loaded 34% model tensors into buffer
[2025-06-14 23:20:29.829809] I loaded 35% model tensors into buffer
[2025-06-14 23:20:29.829820] I loaded 36% model tensors into buffer
[2025-06-14 23:20:29.829839] I loaded 37% model tensors into buffer
[2025-06-14 23:20:29.829853] I loaded 38% model tensors into buffer
[2025-06-14 23:20:29.829861] I loaded 39% model tensors into buffer
[2025-06-14 23:20:29.829872] I loaded 40% model tensors into buffer
[2025-06-14 23:20:29.829893] I loaded 41% model tensors into buffer
[2025-06-14 23:20:29.829900] I loaded 42% model tensors into buffer
[2025-06-14 23:20:29.829921] I loaded 43% model tensors into buffer
[2025-06-14 23:20:29.829937] I loaded 44% model tensors into buffer
[2025-06-14 23:20:29.829951] I loaded 45% model tensors into buffer
[2025-06-14 23:20:29.829960] I loaded 46% model tensors into buffer
[2025-06-14 23:20:29.829974] I loaded 47% model tensors into buffer
[2025-06-14 23:20:29.829989] I loaded 48% model tensors into buffer
[2025-06-14 23:20:29.829999] I loaded 49% model tensors into buffer
[2025-06-14 23:20:30.830018] I loaded 50% model tensors into buffer
[2025-06-14 23:20:30.830027] I loaded 51% model tensors into buffer
[2025-06-14 23:20:30.830044] I loaded 52% model tensors into buffer
[2025-06-14 23:20:30.830060] I loaded 53% model tensors into buffer
[2025-06-14 23:20:30.830081] I loaded 54% model tensors into buffer
[2025-06-14 23:20:30.830099] I loaded 55% model tensors into buffer
[2025-06-14 23:20:30.830116] I loaded 56% model tensors into buffer
[2025-06-14 23:20:30.830132] I loaded 57% model tensors into buffer
[2025-06-14 23:20:30.830152] I loaded 58% model tensors into buffer
[2025-06-14 23:20:30.830170] I loaded 59% model tensors into buffer
[2025-06-14 23:20:30.830181] I loaded 60% model tensors into buffer
[2025-06-14 23:20:30.830194] I loaded 61% model tensors into buffer
[2025-06-14 23:20:30.830213] I loaded 62% model tensors into buffer
[2025-06-14 23:20:30.830230] I loaded 63% model tensors into buffer
[2025-06-14 23:20:30.830247] I loaded 64% model tensors into buffer
[2025-06-14 23:20:30.830263] I loaded 65% model tensors into buffer
[2025-06-14 23:20:30.830271] I loaded 66% model tensors into buffer
[2025-06-14 23:20:30.830287] I loaded 67% model tensors into buffer
[2025-06-14 23:20:30.830302] I loaded 68% model tensors into buffer
[2025-06-14 23:20:30.830316] I loaded 69% model tensors into buffer
[2025-06-14 23:20:30.830335] I loaded 70% model tensors into buffer
[2025-06-14 23:20:30.830355] I loaded 71% model tensors into buffer
[2025-06-14 23:20:30.830368] I loaded 72% model tensors into buffer
[2025-06-14 23:20:30.830385] I loaded 73% model tensors into buffer
[2025-06-14 23:20:30.830410] I loaded 74% model tensors into buffer
[2025-06-14 23:20:30.830426] I loaded 75% model tensors into buffer
[2025-06-14 23:20:30.830448] I loaded 76% model tensors into buffer
[2025-06-14 23:20:30.830464] I loaded 77% model tensors into buffer
[2025-06-14 23:20:30.830472] I loaded 78% model tensors into buffer
[2025-06-14 23:20:30.830487] I loaded 79% model tensors into buffer
[2025-06-14 23:20:30.830499] I loaded 80% model tensors into buffer
[2025-06-14 23:20:30.830520] I loaded 81% model tensors into buffer
[2025-06-14 23:20:30.830539] I loaded 82% model tensors into buffer
[2025-06-14 23:20:30.830551] I loaded 83% model tensors into buffer
[2025-06-14 23:20:30.830567] I loaded 84% model tensors into buffer
[2025-06-14 23:20:30.830582] I loaded 85% model tensors into buffer
[2025-06-14 23:20:30.830597] I loaded 86% model tensors into buffer
[2025-06-14 23:20:30.830615] I loaded 87% model tensors into buffer
[2025-06-14 23:20:30.830634] I loaded 88% model tensors into buffer
[2025-06-14 23:20:30.830641] I loaded 89% model tensors into buffer
[2025-06-14 23:20:30.830661] I loaded 90% model tensors into buffer
[2025-06-14 23:20:30.830682] I loaded 91% model tensors into buffer
[2025-06-14 23:20:30.830699] I loaded 92% model tensors into buffer
[2025-06-14 23:20:30.830710] I loaded 93% model tensors into buffer
[2025-06-14 23:20:30.830730] I loaded 94% model tensors into buffer
[2025-06-14 23:20:30.830755] I loaded 95% model tensors into buffer
[2025-06-14 23:20:30.830774] I loaded 96% model tensors into buffer
[2025-06-14 23:20:30.830790] I loaded 97% model tensors into buffer
[2025-06-14 23:20:30.830807] I loaded 98% model tensors into buffer
[2025-06-14 23:20:30.830817] I loaded 99% model tensors into buffer
[2025-06-14 23:20:30.830828] I loaded 100% model tensors into buffer
[2025-06-14 23:20:30.830830] I llama_context: constructing llama_context
[2025-06-14 23:20:30.830830] I llama_context: n_seq_max = 4
[2025-06-14 23:20:30.830830] I llama_context: n_ctx = 32768
[2025-06-14 23:20:30.830830] I llama_context: n_batch = 2048
[2025-06-14 23:20:30.830830] I llama_context: n_ubatch = 512
[2025-06-14 23:20:30.830830] I llama_context: causal_attn = 1
[2025-06-14 23:20:30.830830] I llama_context: flash_attn = 0
[2025-06-14 23:20:30.830830] I llama_context: freq_base = 1000000.0
[2025-06-14 23:20:30.830830] I llama_context: freq_scale = 0.25
[2025-06-14 23:20:30.830830] W llama_context: n_ctx (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[2025-06-14 23:20:30.830830] W llama_context: requested n_seq_max (4) > 1, but swa_full is not enabled -- performance may be degraded: https://github.com/ggml-org/llama.cpp/pull/13845#issuecomment-2924800573
[2025-06-14 23:20:30.830830] I llama_context: CUDA_Host output buffer size = 0.06 MiB
[2025-06-14 23:20:31.831760] I llama_kv_cache_unified: CUDA0 KV buffer size = 4608.00 MiB
[2025-06-14 23:20:31.831837] I llama_kv_cache_unified: size = 4608.00 MiB ( 32768 cells, 36 layers, 4 seqs), K (f16): 2304.00 MiB, V (f16): 2304.00 MiB
[2025-06-14 23:20:32.832272] I llama_context: CUDA0 compute buffer size = 2080.00 MiB
[2025-06-14 23:20:32.832272] I llama_context: CUDA_Host compute buffer size = 64.01 MiB
[2025-06-14 23:20:32.832272] I llama_context: graph nodes = 1446
[2025-06-14 23:20:32.832272] I llama_context: graph splits = 1
[2025-06-14 23:20:32.832272] I common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
[2025-06-14 23:20:32.832277] I srv load: prompt caching enabled
[2025-06-14 23:20:32.832277] I srv load: context shifting enabled
[2025-06-14 23:20:32.832337] I srv load: chat template, alias: deepseek3, built-in: true, jinja rendering: disabled, tool call: supported, reasoning: supported, example:
You are a helpful assistant.

You CAN call tools to assist with the user query. Do not make assumptions about what values to plug into tools.

You are provided with following tools:

  • get_weather
{
  "name": "get_weather",
  "description": "",
  "parameters": {"type":"object","properties":{"location":{"type":"string"}}}
}
  • get_temperature
{
  "name": "get_temperature",
  "description": "Return the temperature according to the location.",
  "parameters": {"type":"object","properties":{"location":{"type":"string"}}}
}

For each tool call, just generate an answer, no explanation before or after your answer, MUST return as below:
<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>The name of the tool

The input of the tool, must be an JSON object in compact format
```<|tool▁call▁end|><|tool▁calls▁end|><|end▁of▁sentence|>

<|User|>Hello.<|Assistant|>Hi! How can I help you today?<|end▁of▁sentence|><|User|>What is the weather like in Beijing?<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>get_weather
```json
{"location":"Beijing"}
```<|tool▁call▁end|><|tool▁calls▁end|><|end▁of▁sentence|><|tool▁outputs▁begin|><|tool▁output▁begin|>{"weather":"Sunny"}<|tool▁output▁end|><|tool▁outputs▁end|>The weather is Sunny.<|end▁of▁sentence|><|User|>What is the temperature in Beijing?<|Assistant|>
[2025-06-14 23:20:32.832337] I srv                      load: tool call trigger, start: words, end: words
[2025-06-14 23:20:32.832337] I srv                     start: starting
[2025-06-14 23:20:32.832338] I srv                     start: listening host = 0.0.0.0, port = 40038
[2025-06-14 23:20:32.832338] I srv                     start: server is ready
[2025-06-14 23:22:34.954811] I srv                   request: rid 526546878654 |  GET /v1/models 127.0.0.1:57106
[2025-06-14 23:22:34.954811] I srv                  response: rid 526546878654 |  GET /v1/models 127.0.0.1:57106 | status 200 | cost 0.43ms | opened
[2025-06-14 23:22:38.958788] I srv                   request: rid 526550856108 |  GET /v1/models 127.0.0.1:42040
[2025-06-14 23:22:38.958789] I srv                  response: rid 526550856108 |  GET /v1/models 127.0.0.1:42040 | status 200 | cost 0.31ms | opened
[2025-06-14 23:22:43.963405] I srv                   request: rid 526555473395 |  GET /v1/models 127.0.0.1:42054
[2025-06-14 23:22:43.963406] I srv                  response: rid 526555473395 |  GET /v1/models 127.0.0.1:42054 | status 200 | cost 0.51ms | opened
[2025-06-14 23:22:57.977901] I srv                   request: rid 526569969169 |  GET /v1/models 127.0.0.1:47642
[2025-06-14 23:22:57.977902] I srv                  response: rid 526569969169 |  GET /v1/models 127.0.0.1:47642 | status 200 | cost 0.39ms | opened
[2025-06-14 23:22:57.977906] I srv                   request: rid 526569973633 | POST /v1/chat/completions 127.0.0.1:47656
/home/runner/work/llama-box/llama-box/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
[2025-06-14 23:22:58.978137] E ggml_cuda_compute_forward: GET_ROWS failed
[2025-06-14 23:22:58.978137] E CUDA error: no kernel image is available for execution on the device
[2025-06-14 23:22:58.978137] E   current device: 0, in function ggml_cuda_compute_forward at /home/runner/work/llama-box/llama-box/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2366
[2025-06-14 23:22:58.978137] E   err
[New LWP 1432801]
[New LWP 1432807]
[New LWP 1432808]
[New LWP 1432865]
[New LWP 1432866]
[New LWP 1432867]
[New LWP 1432868]
[New LWP 1432869]
[New LWP 1432870]
[New LWP 1432871]
[New LWP 1432872]
[New LWP 1432873]
[New LWP 1432874]
[New LWP 1432875]
[New LWP 1432876]
[New LWP 1432877]
[New LWP 1432878]
[New LWP 1432879]
[New LWP 1432880]
[New LWP 1432881]
[New LWP 1432882]
[New LWP 1432883]
[New LWP 1432884]
[New LWP 1432885]
[New LWP 1432886]
[New LWP 1432887]
[New LWP 1432888]
[New LWP 1432889]
[New LWP 1432890]
[New LWP 1432891]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffff72c3be38 in __GI___poll (fds=0xffffc1beea88, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
#0  0x0000ffff72c3be38 in __GI___poll (fds=0xffffc1beea88, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41      in ../sysdeps/unix/sysv/linux/poll.c
#1  0x00000000004a0e78 in httplib::Server::listen_internal() ()
#2  0x000000000054215c in httpserver::start() ()
#3  0x000000000040eb20 in main ()
[Inferior 1 (process 1432799) detached]
Aborted
-----------------------------------------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions