Skip to content

Conversation

@bbrowning
Copy link
Contributor

Before, when parsing message content, if we came across any invalid utf-8 sequences, we'd forever accumulate them in undecoded_tokens and any subsequent content would get dropped when we eventually found our next stop token.

Now, we detect invalid utf-8 sequences and replace them with the utf-8 replacement character '\uFFFD' and continue parsing further content. In real-world scenarios, sometimes invalid utf-8 sequences are being generated by gpt-oss models. This could be caused by too high temperature settings, prompts with extensive usage of utf-8 characters in unexpected ways that are outside the training datasets, or some combination of both.

The net effect is that parsing will continue making forward progress after we hit an invalid utf-8 sequence, which is important for scenarios where inference servers are generating streaming long message contents and the users will expect those tokens to be streamed back as they're generated instead of buffered for long periods of time in our StreamableParser.

See vllm-project/vllm#26480 for one such real-world scenario encountered in vLLM.

@bbrowning bbrowning marked this pull request as draft October 14, 2025 12:23
@bbrowning
Copy link
Contributor Author

Converting this to draft, as I realize I may be making too many assumptions here about how tokens map across byte boundaries for utf-8 characters. I'm going to add a few more tests for edge cases and likely update this to keep track of undecoded bytes instead of undecoded tokens.

Before, when parsing message content, if we came across any invalid
utf-8 sequences, we'd forever accumulate them in `undecoded_tokens` and
any subsequent content would get dropped when we eventually found our
next stop token.

Now, we detect invalid utf-8 sequences and replace them with the utf-8
replacement character '\uFFFD' and continue parsing further content. In
real-world scenarios, sometimes invalid utf-8 sequences are being
generated by gpt-oss models. This could be caused by too high
temperature settings, prompts with extensive usage of utf-8 characters
in unexpected ways that are outside the training datasets, or some
combination of both.

The net effect is that parsing will continue making forward progress
after we hit an invalid utf-8 sequence, which is important for scenarios
where inference servers are generating streaming long message contents
and the users will expect those tokens to be streamed back as they're
generated instead of buffered for long periods of time in our
`StreamableParser`.

See vllm-project/vllm#26480 for one such
real-world scenario encountered in vLLM.
@bbrowning bbrowning force-pushed the utf8-replacement-char branch from ff86d68 to 40be9ec Compare October 14, 2025 20:08
@bbrowning bbrowning marked this pull request as ready for review October 14, 2025 20:08
@bbrowning
Copy link
Contributor Author

bbrowning commented Oct 14, 2025

Ok, pulling this out of draft. My initial approach was conflating tokens and bytes, and to properly handle invalid utf-8 sequences in StreamableParser without losing any surrounding text we have to actually keep track of the undecoded_bytes so that as we get error_len and valid_len back from the Utf8Error we can properly use those to isolate the failing byte(s) and replace just those with the replacement character in our output stream of text.

@bbrowning bbrowning changed the title Replace invalid utf-8 sequences with the replacement character StreamableParser: Use replacement character for invalid utf-8 sequences Oct 14, 2025
@bbrowning
Copy link
Contributor Author

Behavior Change

The main behavior change here is instead of invalid utf-8 sequences breaking the rest of our stream processing, causing us to forever buffer but not process any subsequent tokens generated, we now replace those invalid sequences with the utf-8 replacement character and continue to process subsequent generated tokens.

This essentially causes StreamableParser to mimic the behavior of the Python HarmonyEncoding.decode method, except for streaming decoding.

The previous behavior did not raise an error, did not use replacement characters, and just silently failed to parse any more generated text from that point onwards in the message content until we hit a stop token. Any content after the invalid utf-8 sequence was lost.

Tests

New tests before changing StreamableParser

Just looking at the newly added tests, before my changes to StreamableParser, 4 of the newly added ones fail and one passes:

....
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding FAILED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_split_across_tokens FAILED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_multi_byte_token FAILED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_multi_byte_token_no_eos_marker FAILED
tests/test_harmony.py::test_streamable_parser_tricky_utf8_decoding PASSED

New tests after changes to StreamableParser

After the changes, all of the new ones pass as well as all the old tests.

...
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding PASSED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_split_across_tokens PASSED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_multi_byte_token PASSED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_multi_byte_token_no_eos_marker PASSED
tests/test_harmony.py::test_streamable_parser_tricky_utf8_decoding PASSED

The test_streamable_parser_tricky_utf8_decoding was added just to ensure I didn't mess up any of the utf-8 handling for 1-byte, 2-byte, 3-byte, 4-byte, right-to-left, and similar areas that would expose improper handling of multi-byte sequences.

Copy link
Collaborator

@scott-oai scott-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some nits, biggest issue here is that i am fairly certain that invalid bytes will leak between messages. if you could add a test case and fix that would be great otherwise i can take a stab at i later today

@scott-oai
Copy link
Collaborator

I had codex take a look to confirm that bytes were being leaked and it came up with this test case which does fail:

#[test]
fn test_streamable_parser_does_not_leak_bytes_between_messages() {
    // This test ensures that any partially decoded bytes from the first message
    // do not leak into the content of the next message.
    let encoding = load_harmony_encoding(HarmonyEncodingName::HarmonyGptOss).unwrap();

    // 9552 is known (in this tokenizer) to expand to bytes that form an incomplete
    // UTF-8 sequence when used alone, which exercises the streaming invalid-UTF-8 path.
    // Construct two assistant messages back-to-back where the first includes invalid
    // UTF-8 bytes and the second is simple ASCII text.
    let first_prefix = "<|start|>assistant<|message|>";
    let first_suffix = "<|end|>";
    let second_prefix = "<|start|>assistant<|message|>";
    let second_suffix = "<|end|>";

    let mut tokens = Vec::new();
    tokens.extend(encoding.tokenizer().encode_with_special_tokens(first_prefix));
    // Two invalid tokens to ensure we end the first message with incomplete UTF-8 bytes.
    tokens.push(9552);
    tokens.push(9552);
    tokens.extend(encoding.tokenizer().encode_with_special_tokens(first_suffix));

    // Second message should be clean and unaffected.
    tokens.extend(encoding.tokenizer().encode_with_special_tokens(second_prefix));
    tokens.extend(encoding.tokenizer().encode_with_special_tokens("Hi"));
    tokens.extend(encoding.tokenizer().encode_with_special_tokens(second_suffix));

    let mut parser = StreamableParser::new(encoding, None).unwrap();
    for t in tokens {
        parser.process(t).unwrap();
    }

    let messages = parser.messages();
    assert_eq!(messages.len(), 2, "expected two parsed messages");

    // Verify the second message content is exactly "Hi" (no leaked replacement chars or bytes).
    let second = &messages[1];
    let expected_second = Message::from_role_and_content(Role::Assistant, "Hi");
    assert_eq!(second, &expected_second, "second message must be clean and isolated");
}

Extract replacement character to constant, fix byte leakage between messages
by clearing undecoded_bytes buffer at EOS, improve error handling for
remaining undecoded tokens, and add comprehensive test coverage for edge cases.

Signed-off-by: Ben Browning <[email protected]>
@bbrowning
Copy link
Contributor Author

Thank you for the test and review. I'm glad you caught the undecoded_bytes leakage across parser usage, and I incorporated your test as well as a few others (that I had my local Cursor with gpt-5-codex help write) into the Rust test suite. This includes an additional test for the edge case you pointed towards around handling of any remaining undecoded tokens and utf-8 parsing once we're processing the eos token.

I also merged in latest changes from main and resolved those conflicts. I'm happy to make any other tweaks as requested, rebase and squash, etc or let you take things from here if you're comfortable.

@bbrowning
Copy link
Contributor Author

It looks like the macos-13 check got caught by an unrelated deprecation of macos-13 runners - from the job logs:

This is a scheduled macos-13 brownout. The macOS-13 based runner images are being deprecated. For more details, see https://github.com/actions/runner-images/issues/13046.

Copy link
Collaborator

@scott-oai scott-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thank you!

@scott-oai scott-oai merged commit d59b0ad into openai:main Nov 5, 2025
17 of 18 checks passed
@bbrowning bbrowning deleted the utf8-replacement-char branch November 5, 2025 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants