fix: handle UTF-8 multibyte sequences truncated at buffer boundary #91
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix: UTF-8 Multibyte Sequences Truncated at Buffer Boundary 2
Problem
Valid UTF-8 text files were incorrectly identified as binary when multibyte UTF-8 sequences were truncated at the 512-byte (
MAX_BYTES) buffer boundary.Root Cause
The latest commit version (
272db64) before this PR performs boundary checks likei + N < totalBytes(lines 239, 244, 250) before validating UTF-8 sequences. When a multibyte sequence starts near the buffer boundary (positions 509-511), this check fails because the continuation bytes at positions 512+ are beyondtotalBytes, preventing validation. As a result, valid UTF-8 sequences are flagged as suspicious bytes, causing text files to be misidentified as binary.Example:
Impact
This bug affects any text file containing:
The 7 test files added in this PR had 4 incorrectly identified as binary in the current version (as of commit
272db64) due to this truncation issue.Solution
Fix false positives where valid UTF-8 sequences were incorrectly flagged as binary when truncated at the
MAX_BYTESboundary.Changes
scanBytesvariable to separate scan range from validation rangeMAX_BYTES + UTF8_BOUNDARY_RESERVE(515 bytes) to capture complete sequencesscanBytes(512 bytes)totalBytes(515 bytes)Implementation Details
The solution elegantly separates two concerns:
Scanning range (
scanBytes): Maintains the original 512-byte limit for:Validation range (
totalBytes): Extends to 515 bytes for:How It Works
This minimal change preserves all existing binary detection behavior while fixing the boundary truncation issue.
Tests
Added 7 Test Cases (All Text Files)
These files demonstrate various UTF-8 multibyte sequences at critical boundary positions:
Verification
All 7 files are valid text files containing UTF-8 sequences at the buffer boundary.
Current version (commit
272db64, before this fix):❌ 4 files incorrectly identified as BINARY:
✅ 3 files correctly identified as text
Reason: UTF-8 sequences at boundary cannot be validated, flagged as suspicious bytes. When suspicious bytes > 1, the logic at
272db64:src/index.ts#L277-L279:marks them as binary
After this PR:
Test Coverage
Related
d39d2c0)