-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
I encountered a CUDA error while running a script that uses the Llama model. The error message is “CUDA error 801 at ggml-cuda.cu:6799: operation not supported”. The current device is 0.
Code Snippet:
def question(message):
# LLM setup
llm = Llama(model_path="./japanese-stablelm-instruct-gamma-7b-q8_0.gguf",
n_gpu_layers=32)
# Run inference
output = llm(
prompt,
temperature=1,
top_p=0.95,
stop=["指示:", "入力:", "応答:"],
echo=False,
max_tokens=1024
)
Error Message:
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 132.92 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7205.83 MB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size = 64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 7342.83 MB (model: 7205.83 MB, context: 137.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
CUDA error 801 at ggml-cuda.cu:6799: operation not supported
current device: 0
Environment:
NVIDIA-SMI 545.23.06
Driver Version: 545.23.06
CUDA Version: 12.3
GPU: Nvidia Quadro M4000 8GB
Any help in resolving this issue would be greatly appreciated.