Skip to content

Conversation

martindevans
Copy link
Member

This is based on discussions in #1099 (comment) and includes work by @martindevans and @m0nsky.

In llama.cpp sampling with a grammar is optimised by:

  1. Running the sampling pipeline without the grammar
  2. Checking if the chosen token is acceptable to the grammar
  3. If not, running the entire pipeline again with the grammar first

This is an optimisation because grammar costs scale with number of tokens it has to check.

This optimisation has been built into the DefaultSamplingPipeline, controlled with a new GrammarOptimization property.

@martindevans
Copy link
Member Author

The default value for the optimisation mode is currently None. This means 99% of users will probably not use this feature! Should it be set to Extended by default?

@m0nsky
Copy link
Contributor

m0nsky commented Feb 23, 2025

Below are some benchmark results for the different grammar optimization modes. The benchmark uses a complex grammar, 8 runs, and ignores the first run (warmup).

Temperature = 0 (greedy) verified to all converge to the same result

Mode Total Benchmark Time (ms) Average Benchmark Time (ms) Average t/s Performance Gain (%)
None 116724 16674 49.15 0.00%
Basic 101867 14552 56.38 +14.71%
Extended 102261 14608 56.14 +14.22%

Temperature = 0.7
TopK = 40

Mode Total Benchmark Time (ms) Average Benchmark Time (ms) Average t/s Performance Gain (%)
None 118295 16899 49.33 0.00%
Basic 103682 14811 51.53 +4.45%
Extended 107862 15408 54.76 +11.01%

I expect the performance gains to be bigger on "simpler" grammars, or when the LLM is finetuned on the grammar, because there will be a higher chance it will correctly guess what token it is allowed to generate next.

As for the default mode, None will theoretically generate the best possible output, but for that use case, greedy sampling will be the best option, for which all modes will converge to the correct result anyways. If there are no issues, then I think Extended will be the best default option.

@phil-scott-78
Copy link
Contributor

here's the results from my rather naive tests (I'm not currently doing a warm up, single run). The "failed" tests are just it not returning the proper value, which for some of these models is to be expected. Test was originally to help me dial in prompt generation, so there are some that fail because of a crummy prompt.

But at a glance, I'm seeing some pretty decent gains across the board!

Model None Basic Extended
Qwen2.5-0.5B-Instruct-Q3_K_L ParseComplexRestaurantOrder: Failed 6522ms
ParseNameAndEmail: Ok 755ms
ParseOrderStatus: Ok 1372ms
ParseNestedAddress: Ok 2586ms
ParseBookCollection: Ok 3363ms
ParseTags: Ok 1112ms
ParseAndInferColor: Failed 1353ms
ParseComplexRestaurantOrder: Failed 1776ms
ParseNameAndEmail: Ok 176ms
ParseOrderStatus: Failed 313ms
ParseNestedAddress: Ok 477ms
ParseBookCollection: Ok 450ms
ParseTags: Ok 184ms
ParseAndInferColor: Failed 184ms
ParseComplexRestaurantOrder: Failed 1737ms
ParseNameAndEmail: Ok 183ms
ParseOrderStatus: Failed 312ms
ParseNestedAddress: Ok 474ms
ParseBookCollection: Ok 444ms
ParseTags: Ok 168ms
ParseAndInferColor: Failed 180ms
Qwen2.5-14B-Instruct-Q6_K_L ParseComplexRestaurantOrder: Ok 12592ms
ParseNameAndEmail: Ok 1301ms
ParseOrderStatus: Ok 2445ms
ParseNestedAddress: Ok 4332ms
ParseBookCollection: Ok 4982ms
ParseTags: Ok 1312ms
ParseAndInferColor: Ok 1569ms
ParseComplexRestaurantOrder: Ok 7631ms
ParseNameAndEmail: Ok 768ms
ParseOrderStatus: Ok 1487ms
ParseNestedAddress: Ok 2515ms
ParseBookCollection: Ok 2664ms
ParseTags: Ok 758ms
ParseAndInferColor: Ok 908ms
ParseComplexRestaurantOrder: Ok 7528ms
ParseNameAndEmail: Ok 784ms
ParseOrderStatus: Ok 1485ms
ParseNestedAddress: Ok 2490ms
ParseBookCollection: Ok 2647ms
ParseTags: Ok 760ms
ParseAndInferColor: Ok 920ms
Qwen2.5-Coder-3B-Instruct-Q8_0 ParseComplexRestaurantOrder: Ok 7834ms
ParseNameAndEmail: Ok 813ms
ParseOrderStatus: Ok 1531ms
ParseNestedAddress: Ok 2691ms
ParseBookCollection: Ok 2666ms
ParseTags: Ok 838ms
ParseAndInferColor: Ok 1021ms
ParseComplexRestaurantOrder: Ok 3084ms
ParseNameAndEmail: Ok 320ms
ParseOrderStatus: Ok 627ms
ParseNestedAddress: Ok 1018ms
ParseBookCollection: Ok 825ms
ParseTags: Ok 311ms
ParseAndInferColor: Ok 383ms
ParseComplexRestaurantOrder: Ok 3015ms
ParseNameAndEmail: Ok 316ms
ParseOrderStatus: Ok 623ms
ParseNestedAddress: Ok 1023ms
ParseBookCollection: Ok 834ms
ParseTags: Ok 316ms
ParseAndInferColor: Ok 385ms
Qwen2.5.1-Coder-1.5B-Instruct-Q8_0 ParseComplexRestaurantOrder: Ok 8223ms
ParseNameAndEmail: Ok 873ms
ParseOrderStatus: Ok 1590ms
ParseNestedAddress: Ok 2827ms
ParseBookCollection: Ok 2578ms
ParseTags: Ok 862ms
ParseAndInferColor: Ok 1013ms
ParseComplexRestaurantOrder: Failed 2111ms
ParseNameAndEmail: Ok 220ms
ParseOrderStatus: Ok 416ms
ParseNestedAddress: Ok 687ms
ParseBookCollection: Ok 792ms
ParseTags: Ok 221ms
ParseAndInferColor: Ok 260ms
ParseComplexRestaurantOrder: Failed 2117ms
ParseNameAndEmail: Ok 224ms
ParseOrderStatus: Ok 416ms
ParseNestedAddress: Ok 685ms
ParseBookCollection: Ok 780ms
ParseTags: Ok 224ms
ParseAndInferColor: Ok 258ms
Qwen2.5.1-Coder-7B-Instruct-Q6_K_L ParseComplexRestaurantOrder: Ok 8828ms
ParseNameAndEmail: Ok 905ms
ParseOrderStatus: Ok 1685ms
ParseNestedAddress: Ok 3027ms
ParseBookCollection: Ok 3509ms
ParseTags: Ok 924ms
ParseAndInferColor: Ok 1099ms
ParseComplexRestaurantOrder: Ok 4064ms
ParseNameAndEmail: Ok 431ms
ParseOrderStatus: Ok 844ms
ParseNestedAddress: Ok 1407ms
ParseBookCollection: Ok 1259ms
ParseTags: Ok 421ms
ParseAndInferColor: Ok 517ms
ParseComplexRestaurantOrder: Ok 4153ms
ParseNameAndEmail: Ok 430ms
ParseOrderStatus: Ok 846ms
ParseNestedAddress: Ok 1402ms
ParseBookCollection: Ok 1255ms
ParseTags: Ok 421ms
ParseAndInferColor: Ok 519ms

@phil-scott-78
Copy link
Contributor

Did a quick test against a deepseek model where the difference was even more pronounced. The GBNF allows for a <think> tag with any character then the json.

None
=====
DeepSeek-R1-Distill-Qwen-7B-Q6_K_L
  ParseComplexRestaurantOrder: Failed 33773ms
  ParseNameAndEmail: Ok 16863ms
  ParseOrderStatus: Ok 18968ms
  ParseNestedAddress: Ok 19952ms
  ParseBookCollection: Ok 32257ms
  ParseTags: Ok 9200ms
  ParseAndInferColor: Ok 14856ms


Extended
========
DeepSeek-R1-Distill-Qwen-7B-Q6_K_L
  ParseComplexRestaurantOrder: Failed13082ms
  ParseNameAndEmail: Ok 6188ms
  ParseOrderStatus: Ok 6871ms
  ParseNestedAddress: Ok 7361ms
  ParseBookCollection: Ok 8503ms
  ParseTags: Ok 3659ms
  ParseAndInferColor: Ok 5916ms

@phil-scott-78
Copy link
Contributor

i've ran this through its paces on 30+ models with a variety of GBNF with different executors without a hiccup. perf still lags a bit behind the command line, it seems, but still a huge improvement.

@m0nsky
Copy link
Contributor

m0nsky commented Feb 24, 2025

I've also done some additional testing to make sure Extended is really the best choice as the default mode. My previous benchmark wasn't deterministic when TopK > 1.

Fixed Grammar Benchmark (16 different short stories in fixed order)

Mode Total Benchmark Time (ms) Average Benchmark Time (ms) Average t/s Performance Gain (%)
None 65455 4363 23.39 0.00%
Basic 57497 3833 26.53 +13.90%
Extended 56800 3786 26.80 +15.20%

Fixed Grammar Benchmark (8 repeated lorem ipsum sample texts)

Mode Total Benchmark Time (ms) Average Benchmark Time (ms) Average t/s Performance Gain (%)
None 40682 13560 63.74 0.00%
Basic 34544 11514 77.70 +15.09%
Extended 32626 10875 81.43 +19.78%

@phil-scott-78
Copy link
Contributor

phil-scott-78 commented Feb 25, 2025

ok, I was on a mission to break things. I've found if I push models larger pretty hard on my hardware (4080 Super) that Extended will periodically break. I do not see this in Basic or None.

This happens consistently with Mistral-Small-Instruct-2409-Q5_K_S which is just big enough to not fit into my GPU memory.

System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
   at LLama.Native.SafeLLamaSamplerChainHandle.<Apply>g__llama_sampler_apply|3_0(SafeLLamaSamplerChainHandle smpl, LLamaTokenDataArrayNative& cur_p)
   at LLama.Native.SafeLLamaSamplerChainHandle.Apply(LLamaTokenDataArrayNative& candidates) in B:\llama-src\LLamaSharp\LLama\Native\SafeLLamaSamplerHandle.cs:line 42
   at LLama.Sampling.DefaultSamplingPipeline.Sample(SafeLLamaContextHandle ctx, Int32 index) in B:\llama-src\LLamaSharp\LLama\Sampling\DefaultSamplingPipeline.cs:line 271
   at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext() in B:\llama-src\LLamaSharp\LLama\LLamaStatelessExecutor.cs:line 119
   at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()

This line is

_grammarChain.Apply(ref nativeTopK);

BUT, to make things more obnoxious this only happens when I'm running things in a big batch loading and unloading other models. Running it by itself seems fine. There is nothing in the logs either. Just a crash.

Whether or not it is an actual issue with Extended mode or if it is just a combination of things that with that specific model it goes kablooey is the question.....

…fore this there was always some junk data at the end.
@martindevans
Copy link
Member Author

@phil-scott-78 I've just pushed up a potential fix. Due to the way memory pooling works we were always passing some invalid data at the end of the array. To be honest I don't really see why this bug would have caused a crash, but I hope it was the issue! Would you remind testing again to see if you can still reproduce it?

@phil-scott-78
Copy link
Contributor

Gave it a few test runs under some different parameters and they are all running successfully. Not gonna lie, I was embarrassed about the quality of that bug report - I'm impressed you figured it out from that lol. Nicely done.

@martindevans
Copy link
Member Author

Thanks for testing that! I think this is now ready to merge?

@m0nsky
Copy link
Contributor

m0nsky commented Feb 26, 2025

Yes I think it's ready now, everything still looks good after the fix on my end. Thanks for the extensive testing @phil-scott-78 , it is really appreciated.

@martindevans martindevans merged commit e2cfed6 into SciSharp:master Feb 26, 2025
6 checks passed
@martindevans martindevans deleted the grammar_resampling branch February 26, 2025 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants