jetson AGX orin里的CUDA Arch不支持

在jetson AGX orin里部署DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf 时没有问题，但是，推理是出现错误。

出现问题的地方应该是这里，jetson AGX Orin的CUDA Arch版本是8.7，而从下边的日志看，【CUDA : ARCHS = 600,610,700,750,800,860,890,900】里边没有870。
-----------------------------------------------------------------------------------------------
[2025-06-14 23:38:50.930306] I system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 
-----------------------------------------------------------------------------------------------

完整的日志如下：
--------------------------------------------------------------
[2025-06-14 23:20:28.828328] I 
[2025-06-14 23:20:28.828328] I arguments  : /home/sunxu/.local/share/pipx/venvs/gpustack/lib/python3.10/site-packages/gpustack/third_party/bin/llama-box/llama-box with arguments: --host 0.0.0.0 --embeddings --gpu-layers 37 --parallel 4 --ctx-size 8192 --port 40038 --model /media/sunxu/llm/gpustack-home/cache/huggingface/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf --alias deepseek-r1-0528-qwen3-8b --no-mmap --no-warmup --ctx-size 32768 --temp 0.6 --top-p 0.95
[2025-06-14 23:20:28.828328] I version    : v0.0.154 (53fe21f)
[2025-06-14 23:20:28.828328] I compiler   : cc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
[2025-06-14 23:20:28.828328] I target     : aarch64-redhat-linux
[2025-06-14 23:20:28.828328] I vendor     : llama.cpp 3ac67535 (5586), stable-diffusion.cpp 3eb18db (204), concurrentqueue 2f09da7 (295), readerwriterqueue 16b48ae (166)
[2025-06-14 23:20:28.828584] I ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[2025-06-14 23:20:28.828584] I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[2025-06-14 23:20:28.828584] I ggml_cuda_init: found 1 CUDA devices:
[2025-06-14 23:20:28.828584] I   Device 0: Orin, compute capability 8.7, VMM: yes
[2025-06-14 23:20:28.828586] I system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 
[2025-06-14 23:20:28.828586] I 
[2025-06-14 23:20:28.828586] I srv                      load: loading model '/media/sunxu/llm/gpustack-home/cache/huggingface/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf'
[2025-06-14 23:20:28.828638] I llama_model_load_from_file_impl: using device CUDA0 (Orin) - 48433 MiB free
[2025-06-14 23:20:28.828683] I llama_model_loader: loaded meta data with 37 key-value pairs and 399 tensors from /media/sunxu/llm/gpustack-home/cache/huggingface/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q6_K_XL.gguf (version GGUF V3 (latest))
[2025-06-14 23:20:28.828683] I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv   0:                       general.architecture str              = qwen3
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv   1:                               general.type str              = model
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv   2:                               general.name str              = Deepseek-R1-0528-Qwen3-8B
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv   3:                           general.basename str              = Deepseek-R1-0528-Qwen3-8B
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv   5:                         general.size_label str              = 8B
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv   7:                          qwen3.block_count u32              = 36
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv   8:                       qwen3.context_length u32              = 131072
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv   9:                     qwen3.embedding_length u32              = 4096
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  10:                  qwen3.feed_forward_length u32              = 12288
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  11:                 qwen3.attention.head_count u32              = 32
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  12:              qwen3.attention.head_count_kv u32              = 8
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  13:                       qwen3.rope.freq_base f32              = 1000000.000000
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  14:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  15:                 qwen3.attention.key_length u32              = 128
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  16:               qwen3.attention.value_length u32              = 128
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  17:                    qwen3.rope.scaling.type str              = yarn
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  18:                  qwen3.rope.scaling.factor f32              = 4.000000
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  19: qwen3.rope.scaling.original_context_length u32              = 32768
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
[2025-06-14 23:20:28.828683] I llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
[2025-06-14 23:20:28.828705] I llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
[2025-06-14 23:20:28.828711] I llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 151643
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151654
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = false
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if not add_generation_prompt is d...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  31:               general.quantization_version u32              = 2
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  32:                          general.file_type u32              = 18
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  33:                      quantize.imatrix.file str              = DeepSeek-R1-0528-Qwen3-8B-GGUF/imatri...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = unsloth_calibration_DeepSeek-R1-0528-...
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 252
[2025-06-14 23:20:28.828732] I llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 713
[2025-06-14 23:20:28.828732] I llama_model_loader: - type  f32:  145 tensors
[2025-06-14 23:20:28.828732] I llama_model_loader: - type q8_0:  130 tensors
[2025-06-14 23:20:28.828732] I llama_model_loader: - type q6_K:  124 tensors
[2025-06-14 23:20:28.828732] I print_info: file format = GGUF V3 (latest)
[2025-06-14 23:20:28.828732] I print_info: file type   = Q6_K
[2025-06-14 23:20:28.828732] I print_info: file size   = 6.97 GiB (7.31 BPW) 
[2025-06-14 23:20:28.828938] W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
[2025-06-14 23:20:28.828938] I load: special tokens cache size = 28
[2025-06-14 23:20:28.828995] I load: token to piece cache size = 0.9311 MB
[2025-06-14 23:20:28.828995] I print_info: arch             = qwen3
[2025-06-14 23:20:28.828995] I print_info: vocab_only       = 0
[2025-06-14 23:20:28.828995] I print_info: n_ctx_train      = 131072
[2025-06-14 23:20:28.828995] I print_info: n_embd           = 4096
[2025-06-14 23:20:28.828995] I print_info: n_layer          = 36
[2025-06-14 23:20:28.828995] I print_info: n_head           = 32
[2025-06-14 23:20:28.828995] I print_info: n_head_kv        = 8
[2025-06-14 23:20:28.828995] I print_info: n_rot            = 128
[2025-06-14 23:20:28.828995] I print_info: n_swa            = 0
[2025-06-14 23:20:28.828995] I print_info: is_swa_any       = 0
[2025-06-14 23:20:28.828995] I print_info: n_embd_head_k    = 128
[2025-06-14 23:20:28.828995] I print_info: n_embd_head_v    = 128
[2025-06-14 23:20:28.828995] I print_info: n_gqa            = 4
[2025-06-14 23:20:28.828995] I print_info: n_embd_k_gqa     = 1024
[2025-06-14 23:20:28.828995] I print_info: n_embd_v_gqa     = 1024
[2025-06-14 23:20:28.828995] I print_info: f_norm_eps       = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_norm_rms_eps   = 1.0e-06
[2025-06-14 23:20:28.828995] I print_info: f_clamp_kqv      = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_max_alibi_bias = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_logit_scale    = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: f_attn_scale     = 0.0e+00
[2025-06-14 23:20:28.828995] I print_info: n_ff             = 12288
[2025-06-14 23:20:28.828995] I print_info: n_expert         = 0
[2025-06-14 23:20:28.828995] I print_info: n_expert_used    = 0
[2025-06-14 23:20:28.828995] I print_info: causal attn      = 1
[2025-06-14 23:20:28.828995] I print_info: pooling type     = 0
[2025-06-14 23:20:28.828995] I print_info: rope type        = 2
[2025-06-14 23:20:28.828995] I print_info: rope scaling     = yarn
[2025-06-14 23:20:28.828995] I print_info: freq_base_train  = 1000000.0
[2025-06-14 23:20:28.828995] I print_info: freq_scale_train = 0.25
[2025-06-14 23:20:28.828995] I print_info: n_ctx_orig_yarn  = 32768
[2025-06-14 23:20:28.828995] I print_info: rope_finetuned   = unknown
[2025-06-14 23:20:28.828995] I print_info: ssm_d_conv       = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_d_inner      = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_d_state      = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_dt_rank      = 0
[2025-06-14 23:20:28.828995] I print_info: ssm_dt_b_c_rms   = 0
[2025-06-14 23:20:28.828995] I print_info: model type       = 8B
[2025-06-14 23:20:28.828995] I print_info: model params     = 8.19 B
[2025-06-14 23:20:28.828995] I print_info: general.name     = Deepseek-R1-0528-Qwen3-8B
[2025-06-14 23:20:28.828995] I print_info: vocab type       = BPE
[2025-06-14 23:20:28.828995] I print_info: n_vocab          = 151936
[2025-06-14 23:20:28.828995] I print_info: n_merges         = 151387
[2025-06-14 23:20:28.828995] I print_info: BOS token        = 151643 '<｜begin▁of▁sentence｜>'
[2025-06-14 23:20:28.828995] I print_info: EOS token        = 151645 '<｜end▁of▁sentence｜>'
[2025-06-14 23:20:28.828995] I print_info: EOT token        = 151645 '<｜end▁of▁sentence｜>'
[2025-06-14 23:20:28.828995] I print_info: PAD token        = 151654 '<|vision_pad|>'
[2025-06-14 23:20:28.828995] I print_info: LF token         = 198 'Ċ'
[2025-06-14 23:20:28.828995] I print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
[2025-06-14 23:20:28.828995] I print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
[2025-06-14 23:20:28.828995] I print_info: FIM MID token    = 151660 '<|fim_middle|>'
[2025-06-14 23:20:28.828995] I print_info: FIM PAD token    = 151662 '<|fim_pad|>'
[2025-06-14 23:20:28.828995] I print_info: FIM REP token    = 151663 '<|repo_name|>'
[2025-06-14 23:20:28.828995] I print_info: FIM SEP token    = 151664 '<|file_sep|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token        = 151645 '<｜end▁of▁sentence｜>'
[2025-06-14 23:20:28.828995] I print_info: EOG token        = 151662 '<|fim_pad|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token        = 151663 '<|repo_name|>'
[2025-06-14 23:20:28.828995] I print_info: EOG token        = 151664 '<|file_sep|>'
[2025-06-14 23:20:28.828995] I print_info: max token length = 256
[2025-06-14 23:20:28.828995] I load_tensors: loading model tensors, this can take a while... (mmap = false)
[2025-06-14 23:20:29.829218] I load_tensors: offloading 36 repeating layers to GPU
[2025-06-14 23:20:29.829218] I load_tensors: offloading output layer to GPU
[2025-06-14 23:20:29.829218] I load_tensors: offloaded 37/37 layers to GPU
[2025-06-14 23:20:29.829218] I load_tensors:    CUDA_Host model buffer size =   630.59 MiB
[2025-06-14 23:20:29.829218] I load_tensors:        CUDA0 model buffer size =  6507.27 MiB
[2025-06-14 23:20:29.829429] I loaded   8% model tensors into buffer
[2025-06-14 23:20:29.829546] I loaded  17% model tensors into buffer
[2025-06-14 23:20:29.829550] I loaded  18% model tensors into buffer
[2025-06-14 23:20:29.829573] I loaded  19% model tensors into buffer
[2025-06-14 23:20:29.829582] I loaded  20% model tensors into buffer
[2025-06-14 23:20:29.829601] I loaded  21% model tensors into buffer
[2025-06-14 23:20:29.829615] I loaded  22% model tensors into buffer
[2025-06-14 23:20:29.829625] I loaded  23% model tensors into buffer
[2025-06-14 23:20:29.829645] I loaded  24% model tensors into buffer
[2025-06-14 23:20:29.829656] I loaded  25% model tensors into buffer
[2025-06-14 23:20:29.829673] I loaded  26% model tensors into buffer
[2025-06-14 23:20:29.829694] I loaded  27% model tensors into buffer
[2025-06-14 23:20:29.829704] I loaded  28% model tensors into buffer
[2025-06-14 23:20:29.829722] I loaded  29% model tensors into buffer
[2025-06-14 23:20:29.829739] I loaded  30% model tensors into buffer
[2025-06-14 23:20:29.829754] I loaded  31% model tensors into buffer
[2025-06-14 23:20:29.829770] I loaded  32% model tensors into buffer
[2025-06-14 23:20:29.829787] I loaded  33% model tensors into buffer
[2025-06-14 23:20:29.829797] I loaded  34% model tensors into buffer
[2025-06-14 23:20:29.829809] I loaded  35% model tensors into buffer
[2025-06-14 23:20:29.829820] I loaded  36% model tensors into buffer
[2025-06-14 23:20:29.829839] I loaded  37% model tensors into buffer
[2025-06-14 23:20:29.829853] I loaded  38% model tensors into buffer
[2025-06-14 23:20:29.829861] I loaded  39% model tensors into buffer
[2025-06-14 23:20:29.829872] I loaded  40% model tensors into buffer
[2025-06-14 23:20:29.829893] I loaded  41% model tensors into buffer
[2025-06-14 23:20:29.829900] I loaded  42% model tensors into buffer
[2025-06-14 23:20:29.829921] I loaded  43% model tensors into buffer
[2025-06-14 23:20:29.829937] I loaded  44% model tensors into buffer
[2025-06-14 23:20:29.829951] I loaded  45% model tensors into buffer
[2025-06-14 23:20:29.829960] I loaded  46% model tensors into buffer
[2025-06-14 23:20:29.829974] I loaded  47% model tensors into buffer
[2025-06-14 23:20:29.829989] I loaded  48% model tensors into buffer
[2025-06-14 23:20:29.829999] I loaded  49% model tensors into buffer
[2025-06-14 23:20:30.830018] I loaded  50% model tensors into buffer
[2025-06-14 23:20:30.830027] I loaded  51% model tensors into buffer
[2025-06-14 23:20:30.830044] I loaded  52% model tensors into buffer
[2025-06-14 23:20:30.830060] I loaded  53% model tensors into buffer
[2025-06-14 23:20:30.830081] I loaded  54% model tensors into buffer
[2025-06-14 23:20:30.830099] I loaded  55% model tensors into buffer
[2025-06-14 23:20:30.830116] I loaded  56% model tensors into buffer
[2025-06-14 23:20:30.830132] I loaded  57% model tensors into buffer
[2025-06-14 23:20:30.830152] I loaded  58% model tensors into buffer
[2025-06-14 23:20:30.830170] I loaded  59% model tensors into buffer
[2025-06-14 23:20:30.830181] I loaded  60% model tensors into buffer
[2025-06-14 23:20:30.830194] I loaded  61% model tensors into buffer
[2025-06-14 23:20:30.830213] I loaded  62% model tensors into buffer
[2025-06-14 23:20:30.830230] I loaded  63% model tensors into buffer
[2025-06-14 23:20:30.830247] I loaded  64% model tensors into buffer
[2025-06-14 23:20:30.830263] I loaded  65% model tensors into buffer
[2025-06-14 23:20:30.830271] I loaded  66% model tensors into buffer
[2025-06-14 23:20:30.830287] I loaded  67% model tensors into buffer
[2025-06-14 23:20:30.830302] I loaded  68% model tensors into buffer
[2025-06-14 23:20:30.830316] I loaded  69% model tensors into buffer
[2025-06-14 23:20:30.830335] I loaded  70% model tensors into buffer
[2025-06-14 23:20:30.830355] I loaded  71% model tensors into buffer
[2025-06-14 23:20:30.830368] I loaded  72% model tensors into buffer
[2025-06-14 23:20:30.830385] I loaded  73% model tensors into buffer
[2025-06-14 23:20:30.830410] I loaded  74% model tensors into buffer
[2025-06-14 23:20:30.830426] I loaded  75% model tensors into buffer
[2025-06-14 23:20:30.830448] I loaded  76% model tensors into buffer
[2025-06-14 23:20:30.830464] I loaded  77% model tensors into buffer
[2025-06-14 23:20:30.830472] I loaded  78% model tensors into buffer
[2025-06-14 23:20:30.830487] I loaded  79% model tensors into buffer
[2025-06-14 23:20:30.830499] I loaded  80% model tensors into buffer
[2025-06-14 23:20:30.830520] I loaded  81% model tensors into buffer
[2025-06-14 23:20:30.830539] I loaded  82% model tensors into buffer
[2025-06-14 23:20:30.830551] I loaded  83% model tensors into buffer
[2025-06-14 23:20:30.830567] I loaded  84% model tensors into buffer
[2025-06-14 23:20:30.830582] I loaded  85% model tensors into buffer
[2025-06-14 23:20:30.830597] I loaded  86% model tensors into buffer
[2025-06-14 23:20:30.830615] I loaded  87% model tensors into buffer
[2025-06-14 23:20:30.830634] I loaded  88% model tensors into buffer
[2025-06-14 23:20:30.830641] I loaded  89% model tensors into buffer
[2025-06-14 23:20:30.830661] I loaded  90% model tensors into buffer
[2025-06-14 23:20:30.830682] I loaded  91% model tensors into buffer
[2025-06-14 23:20:30.830699] I loaded  92% model tensors into buffer
[2025-06-14 23:20:30.830710] I loaded  93% model tensors into buffer
[2025-06-14 23:20:30.830730] I loaded  94% model tensors into buffer
[2025-06-14 23:20:30.830755] I loaded  95% model tensors into buffer
[2025-06-14 23:20:30.830774] I loaded  96% model tensors into buffer
[2025-06-14 23:20:30.830790] I loaded  97% model tensors into buffer
[2025-06-14 23:20:30.830807] I loaded  98% model tensors into buffer
[2025-06-14 23:20:30.830817] I loaded  99% model tensors into buffer
[2025-06-14 23:20:30.830828] I loaded 100% model tensors into buffer
[2025-06-14 23:20:30.830830] I llama_context: constructing llama_context
[2025-06-14 23:20:30.830830] I llama_context: n_seq_max     = 4
[2025-06-14 23:20:30.830830] I llama_context: n_ctx         = 32768
[2025-06-14 23:20:30.830830] I llama_context: n_batch       = 2048
[2025-06-14 23:20:30.830830] I llama_context: n_ubatch      = 512
[2025-06-14 23:20:30.830830] I llama_context: causal_attn   = 1
[2025-06-14 23:20:30.830830] I llama_context: flash_attn    = 0
[2025-06-14 23:20:30.830830] I llama_context: freq_base     = 1000000.0
[2025-06-14 23:20:30.830830] I llama_context: freq_scale    = 0.25
[2025-06-14 23:20:30.830830] W llama_context: n_ctx (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[2025-06-14 23:20:30.830830] W llama_context: requested n_seq_max (4) > 1, but swa_full is not enabled -- performance may be degraded: https://github.com/ggml-org/llama.cpp/pull/13845#issuecomment-2924800573
[2025-06-14 23:20:30.830830] I llama_context:  CUDA_Host  output buffer size =     0.06 MiB
[2025-06-14 23:20:31.831760] I llama_kv_cache_unified:      CUDA0 KV buffer size =  4608.00 MiB
[2025-06-14 23:20:31.831837] I llama_kv_cache_unified: size = 4608.00 MiB ( 32768 cells,  36 layers,  4 seqs), K (f16): 2304.00 MiB, V (f16): 2304.00 MiB
[2025-06-14 23:20:32.832272] I llama_context:      CUDA0 compute buffer size =  2080.00 MiB
[2025-06-14 23:20:32.832272] I llama_context:  CUDA_Host compute buffer size =    64.01 MiB
[2025-06-14 23:20:32.832272] I llama_context: graph nodes  = 1446
[2025-06-14 23:20:32.832272] I llama_context: graph splits = 1
[2025-06-14 23:20:32.832272] I common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
[2025-06-14 23:20:32.832277] I srv                      load: prompt caching enabled
[2025-06-14 23:20:32.832277] I srv                      load: context shifting enabled
[2025-06-14 23:20:32.832337] I srv                      load: chat template, alias: deepseek3, built-in: true, jinja rendering: disabled, tool call: supported, reasoning: supported, example:
You are a helpful assistant.

You CAN call tools to assist with the user query. Do not make assumptions about what values to plug into tools.

You are provided with following tools:

- `get_weather`
```json
{
  "name": "get_weather",
  "description": "",
  "parameters": {"type":"object","properties":{"location":{"type":"string"}}}
}
```

- `get_temperature`
```json
{
  "name": "get_temperature",
  "description": "Return the temperature according to the location.",
  "parameters": {"type":"object","properties":{"location":{"type":"string"}}}
}
```

For each tool call, just generate an answer, no explanation before or after your answer, MUST return as below:
<｜tool▁calls▁begin｜><｜tool▁call▁begin｜>function<｜tool▁sep｜>The name of the tool
```json
The input of the tool, must be an JSON object in compact format
```<｜tool▁call▁end｜><｜tool▁calls▁end｜><｜end▁of▁sentence｜>

<｜User｜>Hello.<｜Assistant｜>Hi! How can I help you today?<｜end▁of▁sentence｜><｜User｜>What is the weather like in Beijing?<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>function<｜tool▁sep｜>get_weather
```json
{"location":"Beijing"}
```<｜tool▁call▁end｜><｜tool▁calls▁end｜><｜end▁of▁sentence｜><｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>{"weather":"Sunny"}<｜tool▁output▁end｜><｜tool▁outputs▁end｜>The weather is Sunny.<｜end▁of▁sentence｜><｜User｜>What is the temperature in Beijing?<｜Assistant｜>
[2025-06-14 23:20:32.832337] I srv                      load: tool call trigger, start: words, end: words
[2025-06-14 23:20:32.832337] I srv                     start: starting
[2025-06-14 23:20:32.832338] I srv                     start: listening host = 0.0.0.0, port = 40038
[2025-06-14 23:20:32.832338] I srv                     start: server is ready
[2025-06-14 23:22:34.954811] I srv                   request: rid 526546878654 |  GET /v1/models 127.0.0.1:57106
[2025-06-14 23:22:34.954811] I srv                  response: rid 526546878654 |  GET /v1/models 127.0.0.1:57106 | status 200 | cost 0.43ms | opened
[2025-06-14 23:22:38.958788] I srv                   request: rid 526550856108 |  GET /v1/models 127.0.0.1:42040
[2025-06-14 23:22:38.958789] I srv                  response: rid 526550856108 |  GET /v1/models 127.0.0.1:42040 | status 200 | cost 0.31ms | opened
[2025-06-14 23:22:43.963405] I srv                   request: rid 526555473395 |  GET /v1/models 127.0.0.1:42054
[2025-06-14 23:22:43.963406] I srv                  response: rid 526555473395 |  GET /v1/models 127.0.0.1:42054 | status 200 | cost 0.51ms | opened
[2025-06-14 23:22:57.977901] I srv                   request: rid 526569969169 |  GET /v1/models 127.0.0.1:47642
[2025-06-14 23:22:57.977902] I srv                  response: rid 526569969169 |  GET /v1/models 127.0.0.1:47642 | status 200 | cost 0.39ms | opened
[2025-06-14 23:22:57.977906] I srv                   request: rid 526569973633 | POST /v1/chat/completions 127.0.0.1:47656
/home/runner/work/llama-box/llama-box/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
[2025-06-14 23:22:58.978137] E ggml_cuda_compute_forward: GET_ROWS failed
[2025-06-14 23:22:58.978137] E CUDA error: no kernel image is available for execution on the device
[2025-06-14 23:22:58.978137] E   current device: 0, in function ggml_cuda_compute_forward at /home/runner/work/llama-box/llama-box/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2366
[2025-06-14 23:22:58.978137] E   err
[New LWP 1432801]
[New LWP 1432807]
[New LWP 1432808]
[New LWP 1432865]
[New LWP 1432866]
[New LWP 1432867]
[New LWP 1432868]
[New LWP 1432869]
[New LWP 1432870]
[New LWP 1432871]
[New LWP 1432872]
[New LWP 1432873]
[New LWP 1432874]
[New LWP 1432875]
[New LWP 1432876]
[New LWP 1432877]
[New LWP 1432878]
[New LWP 1432879]
[New LWP 1432880]
[New LWP 1432881]
[New LWP 1432882]
[New LWP 1432883]
[New LWP 1432884]
[New LWP 1432885]
[New LWP 1432886]
[New LWP 1432887]
[New LWP 1432888]
[New LWP 1432889]
[New LWP 1432890]
[New LWP 1432891]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffff72c3be38 in __GI___poll (fds=0xffffc1beea88, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
#0  0x0000ffff72c3be38 in __GI___poll (fds=0xffffc1beea88, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
41      in ../sysdeps/unix/sysv/linux/poll.c
#1  0x00000000004a0e78 in httplib::Server::listen_internal() ()
#2  0x000000000054215c in httpserver::start() ()
#3  0x000000000040eb20 in main ()
[Inferior 1 (process 1432799) detached]
Aborted
-----------------------------------------------------------------------------------------------




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

jetson AGX orin里的CUDA Arch不支持 #54

出现问题的地方应该是这里，jetson AGX Orin的CUDA Arch版本是8.7，而从下边的日志看，【CUDA : ARCHS = 600,610,700,750,800,860,890,900】里边没有870。

[2025-06-14 23:38:50.930306] I system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

完整的日志如下：

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

jetson AGX orin里的CUDA Arch不支持 #54

Description

出现问题的地方应该是这里，jetson AGX Orin的CUDA Arch版本是8.7，而从下边的日志看，【CUDA : ARCHS = 600,610,700,750,800,860,890,900】里边没有870。

[2025-06-14 23:38:50.930306] I system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

完整的日志如下：

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions