fix: handle UTF-8 multibyte sequences truncated at buffer boundary #90
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix: UTF-8 Multibyte Sequences Truncated at Buffer Boundary
Problem
Valid UTF-8 text files were incorrectly identified as binary when multibyte UTF-8 sequences were truncated at the 512-byte (
MAX_BYTES) buffer boundary.Root Cause
The implementation as of commit
272db64(the latest version before this PR) performs boundary checks likei + N < totalBytesbefore validating UTF-8 sequences. When a multibyte sequence starts near the buffer boundary (positions 509-511), this check fails because the continuation bytes at positions 512+ are beyondtotalBytes, preventing validation. As a result, valid UTF-8 sequences are flagged as suspicious bytes, causing text files to be misidentified as binary.Example:
Impact
This bug affects any text file containing:
The 7 test files added in this PR had 4 incorrectly identified as binary in the current version (as of commit
272db64) due to this truncation issue.Solution
Fix false positives where valid UTF-8 sequences were incorrectly flagged as binary when truncated at the
MAX_BYTESboundary.Changes
READ_BUFFER_SIZE = MAX_BYTES + MAX_UTF8_BYTE_LENGTH - 1 = 515) to validate sequences that start nearMAX_BYTESboundaryValidUTF8Length = 1|2|3|4) to enforce valid UTF-8 sequence lengths at compile timeTechnical Details
bytesReadfor validation range while maintainingtotalBytesfor scan range (scan 0-511, but can validate up to byte 514)MAX_BYTESorMAX_UTF8_BYTE_LENGTHautomatically adjusts all logic without code changesHow It Works
Tests
Added 7 Test Cases (All Text Files)
These files demonstrate various UTF-8 multibyte sequences at critical boundary positions:
Verification
All 7 files are valid text files containing UTF-8 sequences at the buffer boundary.
Current version (commit
272db64, before this fix):❌ 4 files incorrectly identified as BINARY:
✅ 3 files correctly identified as text
Reason: UTF-8 sequences at boundary cannot be validated, flagged as suspicious bytes. When suspicious bytes > 1, the
src/index.tsline 277~279marks them as binary
After this PR:
Test Coverage
Related
d39d2c0)