[BUG]: Tokenization in 0.14.0 adds spaces

### Description

When tokenizing a text and decoding these tokens, one can see that tokenization now (as of version 0.14.0) adds one additional starting space to `text` for every call of `Context.Tokenize(text, addBos, special)`. This is especially bad if a text is tokenized with more than one call.
Version 0.13.0 did not exhibit such behavior. Or at least, it did not add spaces at the start of words, changing their token ids.

This seems fine for most models (I saw this when using `trollek/NinjaMouse-2.4B-32L-danube`), but when I use gemma-1.1-2b-it-Q6_K.gguf (from: bartowski/gemma-1.1-2b-it-GGUF) now, it's not working anymore. The prompt was:
```
<start_of_turn>user
Who are you?<end_of_turn>
<start_of_turn>model

```

Validating with `tokenize` from llama.cpp b2985 (used in LlamaSharp Version 0.13.0):
```
     2 -> '<bos>'
   106 -> '<start_of_turn>'
  2425 -> ' user'
235286 -> '\'
235254 -> 'n'
  6571 -> 'Who'
   708 -> ' are'
   692 -> ' you'
235336 -> '?'
   107 -> '<end_of_turn>'
   730 -> ' \'
235254 -> 'n'
   106 -> '<start_of_turn>'
  2091 -> ' model'
235286 -> '\'
235254 -> 'n'
```
Interestingly the token at position 2 with id 2425 `' user'` adds a starting space to `'user'` (id 1645).

But even the latest llama.cpp b3412 does not work correctly, look at token at position 2 with id 968 `' <'` :
```
     2 -> '<bos>'
   968 -> ' <'
  2997 -> 'start'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
  1645 -> 'user'
   108 -> '
'
  6571 -> 'Who'
   708 -> ' are'
   692 -> ' you'
181537 -> '?<'
   615 -> 'end'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
   108 -> '
'
235322 -> '<'
  2997 -> 'start'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
  2516 -> 'model'
   108 -> '
'
```

Is there a way to completely prevent extra spaces from being added by tokenization anywhere? I will tokenize them by hand if necessary. 😉



### Reproduction Steps

Write the prompt (see above) to `prompts.txt` and run:

for llama.cpp b2985:
```
tokenize.exe "gemma-1.1-2b-it.Q6_K.gguf" "<start_of_turn>user\nWho are you?<end_of_turn>\n<start_of_turn>model\n"
```
or for llama.cpp b3412:
```
llama-tokenize.exe -m "gemma-1.1-2b-it.Q6_K.gguf" -f "prompt.txt"
```

### Environment & Configuration

- Operating system: Windows 10
- .NET runtime version: 8.0
- LLamaSharp version: 0.14.0
- CPU device: Intel Core i7


### Known Workarounds

I would love to know!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: Tokenization in 0.14.0 adds spaces #856

Description

Reproduction Steps

Environment & Configuration

Known Workarounds

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Tokenization in 0.14.0 adds spaces #856

Description

Description

Reproduction Steps

Environment & Configuration

Known Workarounds

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions