Skip to content

Conversation

martindevans
Copy link
Member

@martindevans martindevans commented Oct 22, 2023

See #203 for original bug report.

After some investigation I discovered that the problem is how tokens are converted into text. LLamaSharp has been built with the assumption that one token -> one or more characters (i.e. one to many). However, this is not true! Some characters require multiple tokens to encode:

  • The character maps to multiple tokens: [29871, 239, 181, 163]
  • Test it by pasting that token into this tool or this other tool
  • Both of those external tools show a broken decoding of that character, they have the same bug!

Fixing this required quite an extensive redesign of how tokens are converted into text!

Additions:

  • Introduced a new StreamingTokenDecoder which accepts tokens and accumulates a buffer of characters. If a single token does not decode into a valid character that's fine, it remembers the state internally.
  • Also introduced a new AntipromptProcessor which keeps a buffer of previously decoded text. This is safer than using the TokensEndsWithAnyString methods.
  • Modified StatelessExecutor to use these, it is totally fixed.

Removals:

  • The various DeTokenize and TokenToString methods are all incorrect to use. They have been marked as Obsolete or removed.
  • The TokensEndsWithAnyString methods are also incorrect to use and have been marked as Obsolete.

Work remaining for future PRs:

  • Remove all the obsolete methods.
  • Fix the other executors in the same way.

@martindevans
Copy link
Member Author

Investigation Notes:

  • Using this test string: 철수라는
  • In UTF8 bytes that is: [ 236, 178, 160, 236, 136, 152, 235, 157, 188, 235, 138, 148 ], (each triplet of bytes is one character)
  • Using this tool that should be these tokens: [1, 29871, 239, 181, 163, 30970, 31197, 31081].
    • LLamaSharp does generates the correct tokens.
    • Note that the tool linked above also generates broken characters!
    • So does this other tool!
  • The character maps to multiple tokens: [29871, 239, 181, 163]
    • The LLamaSharp tokenizer assumes that there is a one to many relationship between tokens and characters; i.e. one token can become multiple characters. However in this case it is many to one; i.e. multiple tokens are becoming one character!

@martindevans
Copy link
Member Author

martindevans commented Oct 22, 2023

Fix Notes:

  • We have TokenToString methods - those seem to be broken conceptually since we now know a single token doesn't always map to a valid string.
  • Convert all tokens to bytes, then convert bytes to string in one go?

@martindevans
Copy link
Member Author

I've made a start on fixing this, but it's going to require a bit of a rework of how we handle some detokenization stuff.

…, because sometimes one single character may be represented by multiple tokens).

 - Built a new (hacky) `Detokenize` method which handles this
@martindevans
Copy link
Member Author

martindevans commented Oct 22, 2023

Latest commit removes all TokenToString methods. This is a breaking change, but I think it's needed here. Since we now know that a single character may require several tokens it would never be correct to use the TokenToString methods.

 - `AntipromptProcessor` accepts chunks of text and returns a value indicating if any antiprompt has been detected.
 - `StreamingTokenDecoder` decodes tokens into text, maintaining some internal state to handle single characters which are encoded as multiple tokens.

Added tests for these classes and updated StatelessExecutor to use them.

Removed most DeTokenize methods, marked the rest as obsolete (should always use a `StreamingTokenDecoder`).
@martindevans
Copy link
Member Author

@sinusinu this should fix the issue you reported in #203, as long as you use the StatelessExecutor. There's still work needed to fix the other executors.

Would you mind testing it to confirm if it works for you?

@sinusinu
Copy link

Well, except for the output being a little wacky, looks like StatelessModeExecute example is working fine with Korean output!
20231023-100701-WindowsTerminal

@martindevans
Copy link
Member Author

martindevans commented Oct 23, 2023

except for the output being a little wacky

Is this a significantly different response than it generates without this PR? This should only affect character encoding and antiprompt detection, other than that it shouldn't change the output!

@sinusinu
Copy link

Output text is being...accumulated? Full text is repeatedly printed each time instead of just new tokens. Same thing happens with English.

Without PR:
20231023-233413-WindowsTerminal

With PR:
20231023-233206-WindowsTerminal

@martindevans
Copy link
Member Author

Aha ok, I'll have a look into that later. I'm surprised my unit tests didn't pick that up!

@martindevans
Copy link
Member Author

@sinusinu It should be fixed now.

@sinusinu
Copy link

Yes, now it works as expected. Very nice!
20231024-010137-WindowsTerminal

@martindevans martindevans marked this pull request as ready for review October 23, 2023 19:12
@martindevans martindevans merged commit 5b6408b into SciSharp:master Oct 24, 2023
@martindevans martindevans deleted the roundtrip_tokenization_investigation branch October 24, 2023 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants