Grammar Resampling #1109

martindevans · 2025-02-23T21:38:38Z

This is based on discussions in #1099 (comment) and includes work by @martindevans and @m0nsky.

In llama.cpp sampling with a grammar is optimised by:

Running the sampling pipeline without the grammar
Checking if the chosen token is acceptable to the grammar
If not, running the entire pipeline again with the grammar first

This is an optimisation because grammar costs scale with number of tokens it has to check.

This optimisation has been built into the DefaultSamplingPipeline, controlled with a new GrammarOptimization property.

Grammar optimization modes

martindevans · 2025-02-23T21:40:30Z

The default value for the optimisation mode is currently None. This means 99% of users will probably not use this feature! Should it be set to Extended by default?

m0nsky · 2025-02-23T22:19:36Z

Below are some benchmark results for the different grammar optimization modes. The benchmark uses a complex grammar, 8 runs, and ignores the first run (warmup).

Temperature = 0 (greedy) verified to all converge to the same result

Mode	Total Benchmark Time (ms)	Average Benchmark Time (ms)	Average t/s	Performance Gain (%)
None	116724	16674	49.15	0.00%
Basic	101867	14552	56.38	+14.71%
Extended	102261	14608	56.14	+14.22%

Temperature = 0.7
TopK = 40

Mode	Total Benchmark Time (ms)	Average Benchmark Time (ms)	Average t/s	Performance Gain (%)
None	118295	16899	49.33	0.00%
Basic	103682	14811	51.53	+4.45%
Extended	107862	15408	54.76	+11.01%

I expect the performance gains to be bigger on "simpler" grammars, or when the LLM is finetuned on the grammar, because there will be a higher chance it will correctly guess what token it is allowed to generate next.

As for the default mode, None will theoretically generate the best possible output, but for that use case, greedy sampling will be the best option, for which all modes will converge to the correct result anyways. If there are no issues, then I think Extended will be the best default option.

phil-scott-78 · 2025-02-23T23:39:12Z

here's the results from my rather naive tests (I'm not currently doing a warm up, single run). The "failed" tests are just it not returning the proper value, which for some of these models is to be expected. Test was originally to help me dial in prompt generation, so there are some that fail because of a crummy prompt.

But at a glance, I'm seeing some pretty decent gains across the board!

Model	None	Basic	Extended
Qwen2.5-0.5B-Instruct-Q3_K_L	ParseComplexRestaurantOrder: Failed 6522ms ParseNameAndEmail: Ok 755ms ParseOrderStatus: Ok 1372ms ParseNestedAddress: Ok 2586ms ParseBookCollection: Ok 3363ms ParseTags: Ok 1112ms ParseAndInferColor: Failed 1353ms	ParseComplexRestaurantOrder: Failed 1776ms ParseNameAndEmail: Ok 176ms ParseOrderStatus: Failed 313ms ParseNestedAddress: Ok 477ms ParseBookCollection: Ok 450ms ParseTags: Ok 184ms ParseAndInferColor: Failed 184ms	ParseComplexRestaurantOrder: Failed 1737ms ParseNameAndEmail: Ok 183ms ParseOrderStatus: Failed 312ms ParseNestedAddress: Ok 474ms ParseBookCollection: Ok 444ms ParseTags: Ok 168ms ParseAndInferColor: Failed 180ms
Qwen2.5-14B-Instruct-Q6_K_L	ParseComplexRestaurantOrder: Ok 12592ms ParseNameAndEmail: Ok 1301ms ParseOrderStatus: Ok 2445ms ParseNestedAddress: Ok 4332ms ParseBookCollection: Ok 4982ms ParseTags: Ok 1312ms ParseAndInferColor: Ok 1569ms	ParseComplexRestaurantOrder: Ok 7631ms ParseNameAndEmail: Ok 768ms ParseOrderStatus: Ok 1487ms ParseNestedAddress: Ok 2515ms ParseBookCollection: Ok 2664ms ParseTags: Ok 758ms ParseAndInferColor: Ok 908ms	ParseComplexRestaurantOrder: Ok 7528ms ParseNameAndEmail: Ok 784ms ParseOrderStatus: Ok 1485ms ParseNestedAddress: Ok 2490ms ParseBookCollection: Ok 2647ms ParseTags: Ok 760ms ParseAndInferColor: Ok 920ms
Qwen2.5-Coder-3B-Instruct-Q8_0	ParseComplexRestaurantOrder: Ok 7834ms ParseNameAndEmail: Ok 813ms ParseOrderStatus: Ok 1531ms ParseNestedAddress: Ok 2691ms ParseBookCollection: Ok 2666ms ParseTags: Ok 838ms ParseAndInferColor: Ok 1021ms	ParseComplexRestaurantOrder: Ok 3084ms ParseNameAndEmail: Ok 320ms ParseOrderStatus: Ok 627ms ParseNestedAddress: Ok 1018ms ParseBookCollection: Ok 825ms ParseTags: Ok 311ms ParseAndInferColor: Ok 383ms	ParseComplexRestaurantOrder: Ok 3015ms ParseNameAndEmail: Ok 316ms ParseOrderStatus: Ok 623ms ParseNestedAddress: Ok 1023ms ParseBookCollection: Ok 834ms ParseTags: Ok 316ms ParseAndInferColor: Ok 385ms
Qwen2.5.1-Coder-1.5B-Instruct-Q8_0	ParseComplexRestaurantOrder: Ok 8223ms ParseNameAndEmail: Ok 873ms ParseOrderStatus: Ok 1590ms ParseNestedAddress: Ok 2827ms ParseBookCollection: Ok 2578ms ParseTags: Ok 862ms ParseAndInferColor: Ok 1013ms	ParseComplexRestaurantOrder: Failed 2111ms ParseNameAndEmail: Ok 220ms ParseOrderStatus: Ok 416ms ParseNestedAddress: Ok 687ms ParseBookCollection: Ok 792ms ParseTags: Ok 221ms ParseAndInferColor: Ok 260ms	ParseComplexRestaurantOrder: Failed 2117ms ParseNameAndEmail: Ok 224ms ParseOrderStatus: Ok 416ms ParseNestedAddress: Ok 685ms ParseBookCollection: Ok 780ms ParseTags: Ok 224ms ParseAndInferColor: Ok 258ms
Qwen2.5.1-Coder-7B-Instruct-Q6_K_L	ParseComplexRestaurantOrder: Ok 8828ms ParseNameAndEmail: Ok 905ms ParseOrderStatus: Ok 1685ms ParseNestedAddress: Ok 3027ms ParseBookCollection: Ok 3509ms ParseTags: Ok 924ms ParseAndInferColor: Ok 1099ms	ParseComplexRestaurantOrder: Ok 4064ms ParseNameAndEmail: Ok 431ms ParseOrderStatus: Ok 844ms ParseNestedAddress: Ok 1407ms ParseBookCollection: Ok 1259ms ParseTags: Ok 421ms ParseAndInferColor: Ok 517ms	ParseComplexRestaurantOrder: Ok 4153ms ParseNameAndEmail: Ok 430ms ParseOrderStatus: Ok 846ms ParseNestedAddress: Ok 1402ms ParseBookCollection: Ok 1255ms ParseTags: Ok 421ms ParseAndInferColor: Ok 519ms

phil-scott-78 · 2025-02-23T23:53:27Z

Did a quick test against a deepseek model where the difference was even more pronounced. The GBNF allows for a <think> tag with any character then the json.

None
=====
DeepSeek-R1-Distill-Qwen-7B-Q6_K_L
  ParseComplexRestaurantOrder: Failed 33773ms
  ParseNameAndEmail: Ok 16863ms
  ParseOrderStatus: Ok 18968ms
  ParseNestedAddress: Ok 19952ms
  ParseBookCollection: Ok 32257ms
  ParseTags: Ok 9200ms
  ParseAndInferColor: Ok 14856ms


Extended
========
DeepSeek-R1-Distill-Qwen-7B-Q6_K_L
  ParseComplexRestaurantOrder: Failed13082ms
  ParseNameAndEmail: Ok 6188ms
  ParseOrderStatus: Ok 6871ms
  ParseNestedAddress: Ok 7361ms
  ParseBookCollection: Ok 8503ms
  ParseTags: Ok 3659ms
  ParseAndInferColor: Ok 5916ms

LLama/Sampling/DefaultSamplingPipeline.cs

phil-scott-78 · 2025-02-24T18:21:02Z

i've ran this through its paces on 30+ models with a variety of GBNF with different executors without a hiccup. perf still lags a bit behind the command line, it seems, but still a huge improvement.

m0nsky · 2025-02-24T19:25:16Z

I've also done some additional testing to make sure Extended is really the best choice as the default mode. My previous benchmark wasn't deterministic when TopK > 1.

Fixed Grammar Benchmark (16 different short stories in fixed order)

Mode	Total Benchmark Time (ms)	Average Benchmark Time (ms)	Average t/s	Performance Gain (%)
None	65455	4363	23.39	0.00%
Basic	57497	3833	26.53	+13.90%
Extended	56800	3786	26.80	+15.20%

Fixed Grammar Benchmark (8 repeated lorem ipsum sample texts)

Mode	Total Benchmark Time (ms)	Average Benchmark Time (ms)	Average t/s	Performance Gain (%)
None	40682	13560	63.74	0.00%
Basic	34544	11514	77.70	+15.09%
Extended	32626	10875	81.43	+19.78%

phil-scott-78 · 2025-02-25T01:55:05Z

ok, I was on a mission to break things. I've found if I push models larger pretty hard on my hardware (4080 Super) that Extended will periodically break. I do not see this in Basic or None.

This happens consistently with Mistral-Small-Instruct-2409-Q5_K_S which is just big enough to not fit into my GPU memory.

System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
   at LLama.Native.SafeLLamaSamplerChainHandle.<Apply>g__llama_sampler_apply|3_0(SafeLLamaSamplerChainHandle smpl, LLamaTokenDataArrayNative& cur_p)
   at LLama.Native.SafeLLamaSamplerChainHandle.Apply(LLamaTokenDataArrayNative& candidates) in B:\llama-src\LLamaSharp\LLama\Native\SafeLLamaSamplerHandle.cs:line 42
   at LLama.Sampling.DefaultSamplingPipeline.Sample(SafeLLamaContextHandle ctx, Int32 index) in B:\llama-src\LLamaSharp\LLama\Sampling\DefaultSamplingPipeline.cs:line 271
   at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext() in B:\llama-src\LLamaSharp\LLama\LLamaStatelessExecutor.cs:line 119
   at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()

This line is

_grammarChain.Apply(ref nativeTopK);

BUT, to make things more obnoxious this only happens when I'm running things in a big batch loading and unloading other models. Running it by itself seems fine. There is nothing in the logs either. Just a crash.

Whether or not it is an actual issue with Extended mode or if it is just a combination of things that with that specific model it goes kablooey is the question.....

…fore this there was always some junk data at the end.

martindevans · 2025-02-25T20:37:48Z

@phil-scott-78 I've just pushed up a potential fix. Due to the way memory pooling works we were always passing some invalid data at the end of the array. To be honest I don't really see why this bug would have caused a crash, but I hope it was the issue! Would you remind testing again to see if you can still reproduce it?

phil-scott-78 · 2025-02-25T22:42:46Z

Gave it a few test runs under some different parameters and they are all running successfully. Not gonna lie, I was embarrassed about the quality of that bug report - I'm impressed you figured it out from that lol. Nicely done.

martindevans · 2025-02-26T00:13:43Z

Thanks for testing that! I think this is now ready to merge?

m0nsky · 2025-02-26T12:23:01Z

Yes I think it's ready now, everything still looks good after the fix on my end. Thanks for the extensive testing @phil-scott-78 , it is really appreciated.

martindevans and others added 7 commits February 21, 2025 22:23

First sketch of grammar resampling

0db5c64

Completed resampling system

b3b6397

Add grammar optimization mode

80eef40

Remove the distribution sampler from the grammar chain

ba71fbd

Add the extended grammar optimization logic

6ae495d

Fix comment

0b7f508

Merge pull request #8 from m0nsky/grammar_resampling_m0nsky

8a597c4

Grammar optimization modes

martindevans mentioned this pull request Feb 23, 2025

StatelessExecutor and/or grammar perf? #1099

Closed

martindevans commented Feb 24, 2025

View reviewed changes

LLama/Sampling/DefaultSamplingPipeline.cs Outdated Show resolved Hide resolved

Set Extended as default grammar mode

ea9437d

Fixed usage of rented arrays, slicing them down to the right size. Be…

7bffe1c

…fore this there was always some junk data at the end.

martindevans merged commit e2cfed6 into SciSharp:master Feb 26, 2025
6 checks passed

martindevans deleted the grammar_resampling branch February 26, 2025 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grammar Resampling #1109

Grammar Resampling #1109

Uh oh!

martindevans commented Feb 23, 2025

Uh oh!

martindevans commented Feb 23, 2025

Uh oh!

m0nsky commented Feb 23, 2025 •

edited

Loading

Uh oh!

phil-scott-78 commented Feb 23, 2025

Uh oh!

phil-scott-78 commented Feb 23, 2025

Uh oh!

Uh oh!

phil-scott-78 commented Feb 24, 2025

Uh oh!

m0nsky commented Feb 24, 2025

Uh oh!

phil-scott-78 commented Feb 25, 2025 •

edited

Loading

Uh oh!

martindevans commented Feb 25, 2025

Uh oh!

phil-scott-78 commented Feb 25, 2025

Uh oh!

martindevans commented Feb 26, 2025

Uh oh!

m0nsky commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

Grammar Resampling #1109

Grammar Resampling #1109

Uh oh!

Conversation

martindevans commented Feb 23, 2025

Uh oh!

martindevans commented Feb 23, 2025

Uh oh!

m0nsky commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phil-scott-78 commented Feb 23, 2025

Uh oh!

phil-scott-78 commented Feb 23, 2025

Uh oh!

Uh oh!

phil-scott-78 commented Feb 24, 2025

Uh oh!

m0nsky commented Feb 24, 2025

Fixed Grammar Benchmark (16 different short stories in fixed order)

Fixed Grammar Benchmark (8 repeated lorem ipsum sample texts)

Uh oh!

phil-scott-78 commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindevans commented Feb 25, 2025

Uh oh!

phil-scott-78 commented Feb 25, 2025

Uh oh!

martindevans commented Feb 26, 2025

Uh oh!

m0nsky commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

m0nsky commented Feb 23, 2025 •

edited

Loading

phil-scott-78 commented Feb 25, 2025 •

edited

Loading