-
Notifications
You must be signed in to change notification settings - Fork 225
StreamableParser: Use replacement character for invalid utf-8 sequences #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Converting this to draft, as I realize I may be making too many assumptions here about how tokens map across byte boundaries for utf-8 characters. I'm going to add a few more tests for edge cases and likely update this to keep track of undecoded bytes instead of undecoded tokens. |
Before, when parsing message content, if we came across any invalid utf-8 sequences, we'd forever accumulate them in `undecoded_tokens` and any subsequent content would get dropped when we eventually found our next stop token. Now, we detect invalid utf-8 sequences and replace them with the utf-8 replacement character '\uFFFD' and continue parsing further content. In real-world scenarios, sometimes invalid utf-8 sequences are being generated by gpt-oss models. This could be caused by too high temperature settings, prompts with extensive usage of utf-8 characters in unexpected ways that are outside the training datasets, or some combination of both. The net effect is that parsing will continue making forward progress after we hit an invalid utf-8 sequence, which is important for scenarios where inference servers are generating streaming long message contents and the users will expect those tokens to be streamed back as they're generated instead of buffered for long periods of time in our `StreamableParser`. See vllm-project/vllm#26480 for one such real-world scenario encountered in vLLM.
ff86d68 to
40be9ec
Compare
|
Ok, pulling this out of draft. My initial approach was conflating tokens and bytes, and to properly handle invalid utf-8 sequences in |
Behavior ChangeThe main behavior change here is instead of invalid utf-8 sequences breaking the rest of our stream processing, causing us to forever buffer but not process any subsequent tokens generated, we now replace those invalid sequences with the utf-8 replacement character and continue to process subsequent generated tokens. This essentially causes The previous behavior did not raise an error, did not use replacement characters, and just silently failed to parse any more generated text from that point onwards in the message content until we hit a stop token. Any content after the invalid utf-8 sequence was lost. TestsNew tests before changing
|
scott-oai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some nits, biggest issue here is that i am fairly certain that invalid bytes will leak between messages. if you could add a test case and fix that would be great otherwise i can take a stab at i later today
|
I had codex take a look to confirm that bytes were being leaked and it came up with this test case which does fail: #[test]
fn test_streamable_parser_does_not_leak_bytes_between_messages() {
// This test ensures that any partially decoded bytes from the first message
// do not leak into the content of the next message.
let encoding = load_harmony_encoding(HarmonyEncodingName::HarmonyGptOss).unwrap();
// 9552 is known (in this tokenizer) to expand to bytes that form an incomplete
// UTF-8 sequence when used alone, which exercises the streaming invalid-UTF-8 path.
// Construct two assistant messages back-to-back where the first includes invalid
// UTF-8 bytes and the second is simple ASCII text.
let first_prefix = "<|start|>assistant<|message|>";
let first_suffix = "<|end|>";
let second_prefix = "<|start|>assistant<|message|>";
let second_suffix = "<|end|>";
let mut tokens = Vec::new();
tokens.extend(encoding.tokenizer().encode_with_special_tokens(first_prefix));
// Two invalid tokens to ensure we end the first message with incomplete UTF-8 bytes.
tokens.push(9552);
tokens.push(9552);
tokens.extend(encoding.tokenizer().encode_with_special_tokens(first_suffix));
// Second message should be clean and unaffected.
tokens.extend(encoding.tokenizer().encode_with_special_tokens(second_prefix));
tokens.extend(encoding.tokenizer().encode_with_special_tokens("Hi"));
tokens.extend(encoding.tokenizer().encode_with_special_tokens(second_suffix));
let mut parser = StreamableParser::new(encoding, None).unwrap();
for t in tokens {
parser.process(t).unwrap();
}
let messages = parser.messages();
assert_eq!(messages.len(), 2, "expected two parsed messages");
// Verify the second message content is exactly "Hi" (no leaked replacement chars or bytes).
let second = &messages[1];
let expected_second = Message::from_role_and_content(Role::Assistant, "Hi");
assert_eq!(second, &expected_second, "second message must be clean and isolated");
} |
Extract replacement character to constant, fix byte leakage between messages by clearing undecoded_bytes buffer at EOS, improve error handling for remaining undecoded tokens, and add comprehensive test coverage for edge cases. Signed-off-by: Ben Browning <[email protected]>
|
Thank you for the test and review. I'm glad you caught the I also merged in latest changes from main and resolved those conflicts. I'm happy to make any other tweaks as requested, rebase and squash, etc or let you take things from here if you're comfortable. |
|
It looks like the macos-13 check got caught by an unrelated deprecation of macos-13 runners - from the job logs:
|
scott-oai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thank you!
Before, when parsing message content, if we came across any invalid utf-8 sequences, we'd forever accumulate them in
undecoded_tokensand any subsequent content would get dropped when we eventually found our next stop token.Now, we detect invalid utf-8 sequences and replace them with the utf-8 replacement character '\uFFFD' and continue parsing further content. In real-world scenarios, sometimes invalid utf-8 sequences are being generated by gpt-oss models. This could be caused by too high temperature settings, prompts with extensive usage of utf-8 characters in unexpected ways that are outside the training datasets, or some combination of both.
The net effect is that parsing will continue making forward progress after we hit an invalid utf-8 sequence, which is important for scenarios where inference servers are generating streaming long message contents and the users will expect those tokens to be streamed back as they're generated instead of buffered for long periods of time in our
StreamableParser.See vllm-project/vllm#26480 for one such real-world scenario encountered in vLLM.