Skip to content

Conversation

slaren
Copy link
Member

@slaren slaren commented Sep 5, 2024

There were several issues with KV defragmentation with quantized KV:

  • Requires ggml_cpy from quant to quant, which was not supported in the CUDA backend
  • ggml_backend_sched cannot fallback to the CPU backend either when the destination is pre-allocated, which was not correctly detected
  • Trying to do so would result in a buffer overflow in the graph leafs, which results in a crash

This fixes the issues in ggml_backend_sched and adds support to the CUDA backend for ggml_cpy when the types are the same and the tensors are contiguous (using cudaMemcpyAsync).

Other backends may also be affected.

Fixes #9314

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 5, 2024
@slaren slaren force-pushed the sl/fix-cuda-defrag branch from 290a6e5 to e462919 Compare September 5, 2024 01:50
@slaren slaren merged commit 4db0478 into master Sep 5, 2024
@slaren slaren deleted the sl/fix-cuda-defrag branch September 5, 2024 09:13
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 5, 2024
cuda : fix defrag with quantized KV (ggml-org#9319)
MaggotHATE added a commit to MaggotHATE/Llama_chat that referenced this pull request Sep 11, 2024
* Important: this guards assert in ggml-backend.c introduced in ggml-org/llama.cpp#9319 , be aware
* Merged recent Seed commit
* Added a small .txt guide on code that needs to be added to make clblast work on current llama.cpp
* minor display styling
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: llama-server crash when defragmenting (llama_kv_cache_defrag_internal)

2 participants