Avoid unnecessarily disabling CUDA graphs #119

Nexesenex · 2024-05-15T13:27:17Z

As discussed in PR ggml-org#6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.

* Adding iq4_0_r4 - q4_0 repacked We get PP-512(LLaMA-3.1-8B) = 278 t/s on a Ryzen-7950X CPU, so ~5-6% faster than iq4_nl_x4. * q4_0_r4: NEON Here we get 115.8 t/s, so also ~5% better than iq4_nl_x4. --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Nexesenex merged commit 8d3619c into Nexesenex:sidestream May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid unnecessarily disabling CUDA graphs #119

Avoid unnecessarily disabling CUDA graphs #119

Uh oh!

Nexesenex commented May 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Avoid unnecessarily disabling CUDA graphs #119

Avoid unnecessarily disabling CUDA graphs #119

Uh oh!

Conversation

Nexesenex commented May 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants