RoundTrip Tokenization Errors #205

martindevans · 2023-10-22T14:29:14Z

See #203 for original bug report.

After some investigation I discovered that the problem is how tokens are converted into text. LLamaSharp has been built with the assumption that one token -> one or more characters (i.e. one to many). However, this is not true! Some characters require multiple tokens to encode:

The character 철 maps to multiple tokens: [29871, 239, 181, 163]
Test it by pasting that token into this tool or this other tool
Both of those external tools show a broken decoding of that character, they have the same bug!

Fixing this required quite an extensive redesign of how tokens are converted into text!

Additions:

Introduced a new StreamingTokenDecoder which accepts tokens and accumulates a buffer of characters. If a single token does not decode into a valid character that's fine, it remembers the state internally.
Also introduced a new AntipromptProcessor which keeps a buffer of previously decoded text. This is safer than using the TokensEndsWithAnyString methods.
Modified StatelessExecutor to use these, it is totally fixed.

Removals:

The various DeTokenize and TokenToString methods are all incorrect to use. They have been marked as Obsolete or removed.
The TokensEndsWithAnyString methods are also incorrect to use and have been marked as Obsolete.

Work remaining for future PRs:

Remove all the obsolete methods.
Fix the other executors in the same way.

…eveals some flaws with certain characters

martindevans · 2023-10-22T14:36:19Z

Investigation Notes:

Using this test string: 철수라는
In UTF8 bytes that is: [ 236, 178, 160, 236, 136, 152, 235, 157, 188, 235, 138, 148 ], (each triplet of bytes is one character)
Using this tool that should be these tokens: [1, 29871, 239, 181, 163, 30970, 31197, 31081].
- LLamaSharp does generates the correct tokens.
- Note that the tool linked above also generates broken characters!
- So does this other tool!
The character 철 maps to multiple tokens: [29871, 239, 181, 163]
- The LLamaSharp tokenizer assumes that there is a one to many relationship between tokens and characters; i.e. one token can become multiple characters. However in this case it is many to one; i.e. multiple tokens are becoming one character!

martindevans · 2023-10-22T14:39:46Z

Fix Notes:

We have TokenToString methods - those seem to be broken conceptually since we now know a single token doesn't always map to a valid string.
Convert all tokens to bytes, then convert bytes to string in one go?

martindevans · 2023-10-22T17:28:54Z

I've made a start on fixing this, but it's going to require a bit of a rework of how we handle some detokenization stuff.

…, because sometimes one single character may be represented by multiple tokens). - Built a new (hacky) `Detokenize` method which handles this

martindevans · 2023-10-22T20:44:33Z

Latest commit removes all TokenToString methods. This is a breaking change, but I think it's needed here. Since we now know that a single character may require several tokens it would never be correct to use the TokenToString methods.

- `AntipromptProcessor` accepts chunks of text and returns a value indicating if any antiprompt has been detected. - `StreamingTokenDecoder` decodes tokens into text, maintaining some internal state to handle single characters which are encoded as multiple tokens. Added tests for these classes and updated StatelessExecutor to use them. Removed most DeTokenize methods, marked the rest as obsolete (should always use a `StreamingTokenDecoder`).

martindevans · 2023-10-22T23:35:28Z

@sinusinu this should fix the issue you reported in #203, as long as you use the StatelessExecutor. There's still work needed to fix the other executors.

Would you mind testing it to confirm if it works for you?

sinusinu · 2023-10-23T01:17:09Z

Well, except for the output being a little wacky, looks like StatelessModeExecute example is working fine with Korean output!

martindevans · 2023-10-23T12:45:21Z

except for the output being a little wacky

Is this a significantly different response than it generates without this PR? This should only affect character encoding and antiprompt detection, other than that it shouldn't change the output!

sinusinu · 2023-10-23T14:38:15Z

Output text is being...accumulated? Full text is repeatedly printed each time instead of just new tokens. Same thing happens with English.

Without PR:

With PR:

martindevans · 2023-10-23T14:45:06Z

Aha ok, I'll have a look into that later. I'm surprised my unit tests didn't pick that up!

…ing buffer)

martindevans · 2023-10-23T15:43:05Z

@sinusinu It should be fixed now.

sinusinu · 2023-10-23T16:04:02Z

Yes, now it works as expected. Very nice!

Created a test that "roundtrips" strings through tokenization. This r…

1d0620e

…eveals some flaws with certain characters

martindevans mentioned this pull request Oct 22, 2023

Using LLamaSharp with Korean langauge #203

Closed

- Removed all TokenToString methods (it's never correct to use them…

efdf3d6

…, because sometimes one single character may be represented by multiple tokens). - Built a new (hacky) `Detokenize` method which handles this

Fixed decoding of text "accumulating" over time (never properly clear…

a03fe00

…ing buffer)

martindevans marked this pull request as ready for review October 23, 2023 19:12

martindevans merged commit 5b6408b into SciSharp:master Oct 24, 2023

martindevans deleted the roundtrip_tokenization_investigation branch October 24, 2023 19:46

martindevans mentioned this pull request Oct 25, 2023

Fixed Broken Text Decoding #219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RoundTrip Tokenization Errors #205

RoundTrip Tokenization Errors #205

Uh oh!

martindevans commented Oct 22, 2023 •

edited

Loading

Uh oh!

martindevans commented Oct 22, 2023

Uh oh!

martindevans commented Oct 22, 2023 •

edited

Loading

Uh oh!

martindevans commented Oct 22, 2023

Uh oh!

martindevans commented Oct 22, 2023 •

edited

Loading

Uh oh!

martindevans commented Oct 22, 2023

Uh oh!

sinusinu commented Oct 23, 2023

Uh oh!

martindevans commented Oct 23, 2023 •

edited

Loading

Uh oh!

sinusinu commented Oct 23, 2023

Uh oh!

martindevans commented Oct 23, 2023

Uh oh!

martindevans commented Oct 23, 2023

Uh oh!

sinusinu commented Oct 23, 2023

Uh oh!

Uh oh!

RoundTrip Tokenization Errors #205

RoundTrip Tokenization Errors #205

Uh oh!

Conversation

martindevans commented Oct 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindevans commented Oct 22, 2023

Uh oh!

martindevans commented Oct 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindevans commented Oct 22, 2023

Uh oh!

martindevans commented Oct 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindevans commented Oct 22, 2023

Uh oh!

sinusinu commented Oct 23, 2023

Uh oh!

martindevans commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sinusinu commented Oct 23, 2023

Uh oh!

martindevans commented Oct 23, 2023

Uh oh!

martindevans commented Oct 23, 2023

Uh oh!

sinusinu commented Oct 23, 2023

Uh oh!

Uh oh!

martindevans commented Oct 22, 2023 •

edited

Loading

martindevans commented Oct 22, 2023 •

edited

Loading

martindevans commented Oct 22, 2023 •

edited

Loading

martindevans commented Oct 23, 2023 •

edited

Loading