-
Notifications
You must be signed in to change notification settings - Fork 470
RoundTrip Tokenization Errors #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RoundTrip Tokenization Errors #205
Conversation
…eveals some flaws with certain characters
Investigation Notes:
|
Fix Notes:
|
I've made a start on fixing this, but it's going to require a bit of a rework of how we handle some detokenization stuff. |
…, because sometimes one single character may be represented by multiple tokens). - Built a new (hacky) `Detokenize` method which handles this
Latest commit removes all |
- `AntipromptProcessor` accepts chunks of text and returns a value indicating if any antiprompt has been detected. - `StreamingTokenDecoder` decodes tokens into text, maintaining some internal state to handle single characters which are encoded as multiple tokens. Added tests for these classes and updated StatelessExecutor to use them. Removed most DeTokenize methods, marked the rest as obsolete (should always use a `StreamingTokenDecoder`).
Is this a significantly different response than it generates without this PR? This should only affect character encoding and antiprompt detection, other than that it shouldn't change the output! |
Aha ok, I'll have a look into that later. I'm surprised my unit tests didn't pick that up! |
@sinusinu It should be fixed now. |
See #203 for original bug report.
After some investigation I discovered that the problem is how tokens are converted into text. LLamaSharp has been built with the assumption that one token -> one or more characters (i.e. one to many). However, this is not true! Some characters require multiple tokens to encode:
철
maps to multiple tokens:[29871, 239, 181, 163]
Fixing this required quite an extensive redesign of how tokens are converted into text!
Additions:
StreamingTokenDecoder
which accepts tokens and accumulates a buffer of characters. If a single token does not decode into a valid character that's fine, it remembers the state internally.AntipromptProcessor
which keeps a buffer of previously decoded text. This is safer than using theTokensEndsWithAnyString
methods.StatelessExecutor
to use these, it is totally fixed.Removals:
DeTokenize
andTokenToString
methods are all incorrect to use. They have been marked as Obsolete or removed.TokensEndsWithAnyString
methods are also incorrect to use and have been marked as Obsolete.Work remaining for future PRs: