StreamableParser: Use replacement character for invalid utf-8 sequences #83

bbrowning · 2025-10-13T22:06:12Z

Before, when parsing message content, if we came across any invalid utf-8 sequences, we'd forever accumulate them in undecoded_tokens and any subsequent content would get dropped when we eventually found our next stop token.

Now, we detect invalid utf-8 sequences and replace them with the utf-8 replacement character '\uFFFD' and continue parsing further content. In real-world scenarios, sometimes invalid utf-8 sequences are being generated by gpt-oss models. This could be caused by too high temperature settings, prompts with extensive usage of utf-8 characters in unexpected ways that are outside the training datasets, or some combination of both.

The net effect is that parsing will continue making forward progress after we hit an invalid utf-8 sequence, which is important for scenarios where inference servers are generating streaming long message contents and the users will expect those tokens to be streamed back as they're generated instead of buffered for long periods of time in our StreamableParser.

See vllm-project/vllm#26480 for one such real-world scenario encountered in vLLM.

bbrowning · 2025-10-14T12:24:31Z

Converting this to draft, as I realize I may be making too many assumptions here about how tokens map across byte boundaries for utf-8 characters. I'm going to add a few more tests for edge cases and likely update this to keep track of undecoded bytes instead of undecoded tokens.

Before, when parsing message content, if we came across any invalid utf-8 sequences, we'd forever accumulate them in `undecoded_tokens` and any subsequent content would get dropped when we eventually found our next stop token. Now, we detect invalid utf-8 sequences and replace them with the utf-8 replacement character '\uFFFD' and continue parsing further content. In real-world scenarios, sometimes invalid utf-8 sequences are being generated by gpt-oss models. This could be caused by too high temperature settings, prompts with extensive usage of utf-8 characters in unexpected ways that are outside the training datasets, or some combination of both. The net effect is that parsing will continue making forward progress after we hit an invalid utf-8 sequence, which is important for scenarios where inference servers are generating streaming long message contents and the users will expect those tokens to be streamed back as they're generated instead of buffered for long periods of time in our `StreamableParser`. See vllm-project/vllm#26480 for one such real-world scenario encountered in vLLM.

bbrowning · 2025-10-14T20:11:07Z

Ok, pulling this out of draft. My initial approach was conflating tokens and bytes, and to properly handle invalid utf-8 sequences in StreamableParser without losing any surrounding text we have to actually keep track of the undecoded_bytes so that as we get error_len and valid_len back from the Utf8Error we can properly use those to isolate the failing byte(s) and replace just those with the replacement character in our output stream of text.

bbrowning · 2025-10-14T20:26:29Z

Behavior Change

The main behavior change here is instead of invalid utf-8 sequences breaking the rest of our stream processing, causing us to forever buffer but not process any subsequent tokens generated, we now replace those invalid sequences with the utf-8 replacement character and continue to process subsequent generated tokens.

This essentially causes StreamableParser to mimic the behavior of the Python HarmonyEncoding.decode method, except for streaming decoding.

The previous behavior did not raise an error, did not use replacement characters, and just silently failed to parse any more generated text from that point onwards in the message content until we hit a stop token. Any content after the invalid utf-8 sequence was lost.

Tests

New tests before changing `StreamableParser`

Just looking at the newly added tests, before my changes to StreamableParser, 4 of the newly added ones fail and one passes:

....
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding FAILED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_split_across_tokens FAILED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_multi_byte_token FAILED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_multi_byte_token_no_eos_marker FAILED
tests/test_harmony.py::test_streamable_parser_tricky_utf8_decoding PASSED

New tests after changes to `StreamableParser`

After the changes, all of the new ones pass as well as all the old tests.

...
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding PASSED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_split_across_tokens PASSED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_multi_byte_token PASSED
tests/test_harmony.py::test_streamable_parser_invalid_utf8_decoding_multi_byte_token_no_eos_marker PASSED
tests/test_harmony.py::test_streamable_parser_tricky_utf8_decoding PASSED

The test_streamable_parser_tricky_utf8_decoding was added just to ensure I didn't mess up any of the utf-8 handling for 1-byte, 2-byte, 3-byte, 4-byte, right-to-left, and similar areas that would expose improper handling of multi-byte sequences.

scott-oai

left some nits, biggest issue here is that i am fairly certain that invalid bytes will leak between messages. if you could add a test case and fix that would be great otherwise i can take a stab at i later today

src/encoding.rs

scott-oai · 2025-11-03T15:40:00Z

I had codex take a look to confirm that bytes were being leaked and it came up with this test case which does fail:

#[test]
fn test_streamable_parser_does_not_leak_bytes_between_messages() {
    // This test ensures that any partially decoded bytes from the first message
    // do not leak into the content of the next message.
    let encoding = load_harmony_encoding(HarmonyEncodingName::HarmonyGptOss).unwrap();

    // 9552 is known (in this tokenizer) to expand to bytes that form an incomplete
    // UTF-8 sequence when used alone, which exercises the streaming invalid-UTF-8 path.
    // Construct two assistant messages back-to-back where the first includes invalid
    // UTF-8 bytes and the second is simple ASCII text.
    let first_prefix = "<|start|>assistant<|message|>";
    let first_suffix = "<|end|>";
    let second_prefix = "<|start|>assistant<|message|>";
    let second_suffix = "<|end|>";

    let mut tokens = Vec::new();
    tokens.extend(encoding.tokenizer().encode_with_special_tokens(first_prefix));
    // Two invalid tokens to ensure we end the first message with incomplete UTF-8 bytes.
    tokens.push(9552);
    tokens.push(9552);
    tokens.extend(encoding.tokenizer().encode_with_special_tokens(first_suffix));

    // Second message should be clean and unaffected.
    tokens.extend(encoding.tokenizer().encode_with_special_tokens(second_prefix));
    tokens.extend(encoding.tokenizer().encode_with_special_tokens("Hi"));
    tokens.extend(encoding.tokenizer().encode_with_special_tokens(second_suffix));

    let mut parser = StreamableParser::new(encoding, None).unwrap();
    for t in tokens {
        parser.process(t).unwrap();
    }

    let messages = parser.messages();
    assert_eq!(messages.len(), 2, "expected two parsed messages");

    // Verify the second message content is exactly "Hi" (no leaked replacement chars or bytes).
    let second = &messages[1];
    let expected_second = Message::from_role_and_content(Role::Assistant, "Hi");
    assert_eq!(second, &expected_second, "second message must be clean and isolated");
}

Extract replacement character to constant, fix byte leakage between messages by clearing undecoded_bytes buffer at EOS, improve error handling for remaining undecoded tokens, and add comprehensive test coverage for edge cases. Signed-off-by: Ben Browning <[email protected]>

bbrowning · 2025-11-04T19:12:44Z

Thank you for the test and review. I'm glad you caught the undecoded_bytes leakage across parser usage, and I incorporated your test as well as a few others (that I had my local Cursor with gpt-5-codex help write) into the Rust test suite. This includes an additional test for the edge case you pointed towards around handling of any remaining undecoded tokens and utf-8 parsing once we're processing the eos token.

I also merged in latest changes from main and resolved those conflicts. I'm happy to make any other tweaks as requested, rebase and squash, etc or let you take things from here if you're comfortable.

bbrowning · 2025-11-04T19:18:55Z

It looks like the macos-13 check got caught by an unrelated deprecation of macos-13 runners - from the job logs:

This is a scheduled macos-13 brownout. The macOS-13 based runner images are being deprecated. For more details, see https://github.com/actions/runner-images/issues/13046.

scott-oai

lgtm thank you!

bbrowning marked this pull request as draft October 14, 2025 12:23

bbrowning force-pushed the utf8-replacement-char branch from ff86d68 to 40be9ec Compare October 14, 2025 20:08

bbrowning marked this pull request as ready for review October 14, 2025 20:08

bbrowning changed the title ~~Replace invalid utf-8 sequences with the replacement character~~ StreamableParser: Use replacement character for invalid utf-8 sequences Oct 14, 2025

bbrowning mentioned this pull request Oct 15, 2025

[Bug][v0.11.0]: gpt-oss-120b generates with no output vllm-project/vllm#26480

Open

1 task

openai deleted a comment from dkundel-openai Nov 3, 2025

scott-oai requested changes Nov 3, 2025

View reviewed changes

src/encoding.rs Show resolved Hide resolved

src/encoding.rs Show resolved Hide resolved

src/encoding.rs Outdated Show resolved Hide resolved

src/encoding.rs Outdated Show resolved Hide resolved

bbrowning added 2 commits November 4, 2025 09:46

Merge branch 'main' into utf8-replacement-char

f3b868b

scott-oai approved these changes Nov 5, 2025

View reviewed changes

scott-oai merged commit d59b0ad into openai:main Nov 5, 2025
17 of 18 checks passed

bbrowning deleted the utf8-replacement-char branch November 5, 2025 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

StreamableParser: Use replacement character for invalid utf-8 sequences #83

StreamableParser: Use replacement character for invalid utf-8 sequences #83

Uh oh!

bbrowning commented Oct 13, 2025

Uh oh!

bbrowning commented Oct 14, 2025

Uh oh!

bbrowning commented Oct 14, 2025 •

edited

Loading

Uh oh!

bbrowning commented Oct 14, 2025

Uh oh!

scott-oai left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scott-oai commented Nov 3, 2025

Uh oh!

bbrowning commented Nov 4, 2025

Uh oh!

bbrowning commented Nov 4, 2025

Uh oh!

scott-oai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

StreamableParser: Use replacement character for invalid utf-8 sequences #83

StreamableParser: Use replacement character for invalid utf-8 sequences #83

Uh oh!

Conversation

bbrowning commented Oct 13, 2025

Uh oh!

bbrowning commented Oct 14, 2025

Uh oh!

bbrowning commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbrowning commented Oct 14, 2025

Behavior Change

Tests

New tests before changing StreamableParser

New tests after changes to StreamableParser

Uh oh!

scott-oai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scott-oai commented Nov 3, 2025

Uh oh!

bbrowning commented Nov 4, 2025

Uh oh!

bbrowning commented Nov 4, 2025

Uh oh!

scott-oai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bbrowning commented Oct 14, 2025 •

edited

Loading

New tests before changing `StreamableParser`

New tests after changes to `StreamableParser`