-
Notifications
You must be signed in to change notification settings - Fork 475
Closed
Description
Description
Working on a gbnf generation library, and the perf started bugging me. Once I started getting some larger outputs of json the perf difference started getting really slow.
For example, when I run this command against the llama.cpp cli, I'm seeing total time = 3374.20 ms / 462 tokens
.\llama-cli -m B:\models\Qwen2.5-Coder-3B-Instruct-Q8_0.gguf --grammar-file B:\llama-src\LLamaSharp\LLama.Examples\Assets\json.gbnf -ngl 48 -no-cnv --prompt "give me a list of all nfl teams in the afc. include their team, city and state. group by division in json format"
Running the grammar example with the same prompt with the same model and the parameters tweaked to match the CLI (context, gpu layers), and I'm looking at about 16s to run the prompt.
I'm not 100% sure if it is related to grammar sampling or the StatelessExecutor
though, and I'll sheepishly admit I'm just pulling levers here trying to track down the culprit. A bit of guidance to a culprit and I can get after it.
Metadata
Metadata
Assignees
Labels
No labels