-
Notifications
You must be signed in to change notification settings - Fork 470
Description
Description
When tokenizing a text and decoding these tokens, one can see that tokenization now (as of version 0.14.0) adds one additional starting space to text
for every call of Context.Tokenize(text, addBos, special)
. This is especially bad if a text is tokenized with more than one call.
Version 0.13.0 did not exhibit such behavior. Or at least, it did not add spaces at the start of words, changing their token ids.
This seems fine for most models (I saw this when using trollek/NinjaMouse-2.4B-32L-danube
), but when I use gemma-1.1-2b-it-Q6_K.gguf (from: bartowski/gemma-1.1-2b-it-GGUF) now, it's not working anymore. The prompt was:
<start_of_turn>user
Who are you?<end_of_turn>
<start_of_turn>model
Validating with tokenize
from llama.cpp b2985 (used in LlamaSharp Version 0.13.0):
2 -> '<bos>'
106 -> '<start_of_turn>'
2425 -> ' user'
235286 -> '\'
235254 -> 'n'
6571 -> 'Who'
708 -> ' are'
692 -> ' you'
235336 -> '?'
107 -> '<end_of_turn>'
730 -> ' \'
235254 -> 'n'
106 -> '<start_of_turn>'
2091 -> ' model'
235286 -> '\'
235254 -> 'n'
Interestingly the token at position 2 with id 2425 ' user'
adds a starting space to 'user'
(id 1645).
But even the latest llama.cpp b3412 does not work correctly, look at token at position 2 with id 968 ' <'
:
2 -> '<bos>'
968 -> ' <'
2997 -> 'start'
235298 -> '_'
559 -> 'of'
235298 -> '_'
15508 -> 'turn'
235313 -> '>'
1645 -> 'user'
108 -> '
'
6571 -> 'Who'
708 -> ' are'
692 -> ' you'
181537 -> '?<'
615 -> 'end'
235298 -> '_'
559 -> 'of'
235298 -> '_'
15508 -> 'turn'
235313 -> '>'
108 -> '
'
235322 -> '<'
2997 -> 'start'
235298 -> '_'
559 -> 'of'
235298 -> '_'
15508 -> 'turn'
235313 -> '>'
2516 -> 'model'
108 -> '
'
Is there a way to completely prevent extra spaces from being added by tokenization anywhere? I will tokenize them by hand if necessary. 😉
Reproduction Steps
Write the prompt (see above) to prompts.txt
and run:
for llama.cpp b2985:
tokenize.exe "gemma-1.1-2b-it.Q6_K.gguf" "<start_of_turn>user\nWho are you?<end_of_turn>\n<start_of_turn>model\n"
or for llama.cpp b3412:
llama-tokenize.exe -m "gemma-1.1-2b-it.Q6_K.gguf" -f "prompt.txt"
Environment & Configuration
- Operating system: Windows 10
- .NET runtime version: 8.0
- LLamaSharp version: 0.14.0
- CPU device: Intel Core i7
Known Workarounds
I would love to know!