Skip to content

[BUG]: Tokenization in 0.14.0 adds spaces #856

@newsletternewsletter

Description

@newsletternewsletter

Description

When tokenizing a text and decoding these tokens, one can see that tokenization now (as of version 0.14.0) adds one additional starting space to text for every call of Context.Tokenize(text, addBos, special). This is especially bad if a text is tokenized with more than one call.
Version 0.13.0 did not exhibit such behavior. Or at least, it did not add spaces at the start of words, changing their token ids.

This seems fine for most models (I saw this when using trollek/NinjaMouse-2.4B-32L-danube), but when I use gemma-1.1-2b-it-Q6_K.gguf (from: bartowski/gemma-1.1-2b-it-GGUF) now, it's not working anymore. The prompt was:

<start_of_turn>user
Who are you?<end_of_turn>
<start_of_turn>model

Validating with tokenize from llama.cpp b2985 (used in LlamaSharp Version 0.13.0):

     2 -> '<bos>'
   106 -> '<start_of_turn>'
  2425 -> ' user'
235286 -> '\'
235254 -> 'n'
  6571 -> 'Who'
   708 -> ' are'
   692 -> ' you'
235336 -> '?'
   107 -> '<end_of_turn>'
   730 -> ' \'
235254 -> 'n'
   106 -> '<start_of_turn>'
  2091 -> ' model'
235286 -> '\'
235254 -> 'n'

Interestingly the token at position 2 with id 2425 ' user' adds a starting space to 'user' (id 1645).

But even the latest llama.cpp b3412 does not work correctly, look at token at position 2 with id 968 ' <' :

     2 -> '<bos>'
   968 -> ' <'
  2997 -> 'start'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
  1645 -> 'user'
   108 -> '
'
  6571 -> 'Who'
   708 -> ' are'
   692 -> ' you'
181537 -> '?<'
   615 -> 'end'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
   108 -> '
'
235322 -> '<'
  2997 -> 'start'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
  2516 -> 'model'
   108 -> '
'

Is there a way to completely prevent extra spaces from being added by tokenization anywhere? I will tokenize them by hand if necessary. 😉

Reproduction Steps

Write the prompt (see above) to prompts.txt and run:

for llama.cpp b2985:

tokenize.exe "gemma-1.1-2b-it.Q6_K.gguf" "<start_of_turn>user\nWho are you?<end_of_turn>\n<start_of_turn>model\n"

or for llama.cpp b3412:

llama-tokenize.exe -m "gemma-1.1-2b-it.Q6_K.gguf" -f "prompt.txt"

Environment & Configuration

  • Operating system: Windows 10
  • .NET runtime version: 8.0
  • LLamaSharp version: 0.14.0
  • CPU device: Intel Core i7

Known Workarounds

I would love to know!

Metadata

Metadata

Assignees

No one assigned

    Labels

    UpstreamTracking an issue in llama.cppbugSomething isn't workingstaleStale issue will be autoclosed soon

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions