-
Notifications
You must be signed in to change notification settings - Fork 470
Grammar Resampling #1109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grammar Resampling #1109
Conversation
Grammar optimization modes
The default value for the optimisation mode is currently |
Below are some benchmark results for the different grammar optimization modes. The benchmark uses a complex grammar, 8 runs, and ignores the first run (warmup).
I expect the performance gains to be bigger on "simpler" grammars, or when the LLM is finetuned on the grammar, because there will be a higher chance it will correctly guess what token it is allowed to generate next. As for the default mode, |
here's the results from my rather naive tests (I'm not currently doing a warm up, single run). The "failed" tests are just it not returning the proper value, which for some of these models is to be expected. Test was originally to help me dial in prompt generation, so there are some that fail because of a crummy prompt. But at a glance, I'm seeing some pretty decent gains across the board!
|
Did a quick test against a deepseek model where the difference was even more pronounced. The GBNF allows for a
|
i've ran this through its paces on 30+ models with a variety of GBNF with different executors without a hiccup. perf still lags a bit behind the command line, it seems, but still a huge improvement. |
I've also done some additional testing to make sure Fixed Grammar Benchmark (16 different short stories in fixed order)
Fixed Grammar Benchmark (8 repeated lorem ipsum sample texts)
|
ok, I was on a mission to break things. I've found if I push models larger pretty hard on my hardware (4080 Super) that Extended will periodically break. I do not see this in Basic or None. This happens consistently with Mistral-Small-Instruct-2409-Q5_K_S which is just big enough to not fit into my GPU memory.
This line is _grammarChain.Apply(ref nativeTopK); BUT, to make things more obnoxious this only happens when I'm running things in a big batch loading and unloading other models. Running it by itself seems fine. There is nothing in the logs either. Just a crash. Whether or not it is an actual issue with Extended mode or if it is just a combination of things that with that specific model it goes kablooey is the question..... |
…fore this there was always some junk data at the end.
@phil-scott-78 I've just pushed up a potential fix. Due to the way memory pooling works we were always passing some invalid data at the end of the array. To be honest I don't really see why this bug would have caused a crash, but I hope it was the issue! Would you remind testing again to see if you can still reproduce it? |
Gave it a few test runs under some different parameters and they are all running successfully. Not gonna lie, I was embarrassed about the quality of that bug report - I'm impressed you figured it out from that lol. Nicely done. |
Thanks for testing that! I think this is now ready to merge? |
Yes I think it's ready now, everything still looks good after the fix on my end. Thanks for the extensive testing @phil-scott-78 , it is really appreciated. |
This is based on discussions in #1099 (comment) and includes work by @martindevans and @m0nsky.
In llama.cpp sampling with a grammar is optimised by:
This is an optimisation because grammar costs scale with number of tokens it has to check.
This optimisation has been built into the
DefaultSamplingPipeline
, controlled with a newGrammarOptimization
property.