fix: handle UTF-8 multibyte sequences truncated at buffer boundary #90

f14XuanLv · 2025-11-08T12:12:23Z

Fix: UTF-8 Multibyte Sequences Truncated at Buffer Boundary

Problem

Valid UTF-8 text files were incorrectly identified as binary when multibyte UTF-8 sequences were truncated at the 512-byte (MAX_BYTES) buffer boundary.

Root Cause

The implementation as of commit 272db64 (the latest version before this PR) performs boundary checks like i + N < totalBytes before validating UTF-8 sequences. When a multibyte sequence starts near the buffer boundary (positions 509-511), this check fails because the continuation bytes at positions 512+ are beyond totalBytes, preventing validation. As a result, valid UTF-8 sequences are flagged as suspicious bytes, causing text files to be misidentified as binary.

Example:

File: [509 bytes] + [0xF0 0x9F 0x98 0x80]  (4-byte emoji at positions 509-512)
                     └─ Position 509 ─┘    └─ Position 512 (unreachable)

Original code check:
  if (i + 3 < totalBytes)  // 509 + 3 < 512? → 512 < 512? → FALSE ❌
  // Validation never happens, 0xF0 marked as suspicious

Impact

This bug affects any text file containing:

Multibyte UTF-8 sequences (2-4 bytes) that cross the 512-byte boundary
Common in files with: Chinese, Japanese, Korean, emoji, or other non-ASCII Unicode characters

The 7 test files added in this PR had 4 incorrectly identified as binary in the current version (as of commit 272db64) due to this truncation issue.

Solution

Fix false positives where valid UTF-8 sequences were incorrectly flagged as binary when truncated at the MAX_BYTES boundary.

Changes

Read extra bytes (READ_BUFFER_SIZE = MAX_BYTES + MAX_UTF8_BYTE_LENGTH - 1 = 515) to validate sequences that start near MAX_BYTES boundary
Add type-safe configuration (ValidUTF8Length = 1|2|3|4) to enforce valid UTF-8 sequence lengths at compile time
Refactor UTF-8 detection to generic boundary-aware validation that separates detection (sequence length) from validation (continuation bytes)
Fix failed sequence handling to match original behavior (skip first continuation byte to avoid double-counting suspicious bytes)

Technical Details

Separates detection from validation to eliminate circular dependency (need to know sequence length before setting boundary parameters)
Uses bytesRead for validation range while maintaining totalBytes for scan range (scan 0-511, but can validate up to byte 514)
Maintains translation invariance: changing MAX_BYTES or MAX_UTF8_BYTE_LENGTH automatically adjusts all logic without code changes

How It Works

// Original: Hard-coded boundary check
if (i + 3 < totalBytes) { /* validate 4-byte sequence */ }
// ❌ Fails at boundary: 509 + 3 < 512 → FALSE

// Fixed: Dynamic boundary-aware validation
expectedLength = 4;
maxCheckLimit = Math.min(i + expectedLength, bytesRead);
if (maxCheckLimit >= i + expectedLength) { /* validate */ }
// ✅ Works at boundary: min(509+4, 515) >= 513 → TRUE

Tests

Added 7 Test Cases (All Text Files)

These files demonstrate various UTF-8 multibyte sequences at critical boundary positions:

508A-4byte.txt - 4-byte UTF-8 sequence starting at position 508
509A-3byte.txt - 3-byte UTF-8 sequence starting at position 509
509A-4byte.txt - 4-byte UTF-8 sequence starting at position 509
510A-2byte.txt - 2-byte UTF-8 sequence starting at position 510
510A-3byte.txt - 3-byte UTF-8 sequence starting at position 510
510A-4byte.txt - 4-byte UTF-8 sequence starting at position 510
utf8-boundary-truncation_case.py - Real-world Python file with Chinese characters (3-byte UTF-8 at positions 510-512)

Verification

All 7 files are valid text files containing UTF-8 sequences at the buffer boundary.

Current version (commit 272db64, before this fix):

❌ 4 files incorrectly identified as BINARY:
- 509A-4byte.txt
- 510A-3byte.txt
- 510A-4byte.txt
- utf8-boundary-truncation_case.py
✅ 3 files correctly identified as text
Reason: UTF-8 sequences at boundary cannot be validated, flagged as suspicious bytes. When suspicious bytes > 1, the src/index.ts line 277~279
```
  if (suspiciousBytes > 1 && isBinaryProto(fileBuffer, scanBytes)) {
    return true;
  }
```
marks them as binary

After this PR:

✅ All 7 files correctly identified as TEXT
✅ Total: 40/40 tests pass

Test Coverage

npm test
# Test Suites: 1 passed, 1 total
# Tests:       40 passed, 40 total

Fix false positives where valid UTF-8 sequences were incorrectly flagged as binary when truncated at the MAX_BYTES boundary. Changes: - Read extra bytes (READ_BUFFER_SIZE = MAX_BYTES + MAX_UTF8_BYTE_LENGTH - 1) to validate sequences that start near MAX_BYTES boundary - Add type-safe configuration (ValidUTF8Length = 1|2|3|4) - Refactor UTF-8 detection to generic boundary-aware validation - Fix failed sequence handling to match original behavior (skip first continuation byte to avoid double-counting) Tests: - Add 6 boundary test cases (508A-4byte, 509A-3byte, 509A-4byte, 510A-2byte, 510A-3byte, 510A-4byte) - All 39 tests pass Technical details: - Separates detection (sequence length) from validation (continuation bytes) to eliminate circular dependency - Uses bytesRead for validation range while maintaining totalBytes for scan range - Maintains translation invariance: changing MAX_BYTES or MAX_UTF8_BYTE_LENGTH automatically adjusts all logic

Add utf8-boundary-truncation_case.py as a regression test for the boundary truncation bug. This Python file contains Chinese characters with a 3-byte UTF-8 sequence (0xE5 0xAF 0xB8) at positions 510-512, which triggers the original bug where sequences truncated at the 512-byte boundary were incorrectly flagged as binary. The file demonstrates a real-world scenario where valid text files were misidentified due to UTF-8 multibyte sequences crossing the buffer boundary. Test: Expect false (text file, not binary)

Fix false positives where valid UTF-8 sequences were incorrectly flagged as binary when truncated at the MAX_BYTES boundary. Changes: - Introduce scanBytes variable to separate scan range from validation range - Read extra bytes (MAX_BYTES + UTF8_BOUNDARY_RESERVE) to capture complete sequences at boundary - Maintain MAX_BYTES scan limit for binary detection logic - Enable UTF-8 validation to access up to MAX_BYTES + UTF8_BOUNDARY_RESERVE Tests: - Add 7 boundary test cases including real-world Python file with Chinese characters (utf8-boundary-truncation_case.py) - Covers 2/3/4-byte UTF-8 sequences at positions near MAX_BYTES boundary - All 40 tests pass Technical details: - Minimal change preserving all existing UTF-8 detection logic - scanBytes controls loop boundary and percentage calculations - totalBytes allows validation of sequences crossing MAX_BYTES boundary - Maintains backward compatibility and binary detection thresholds This addresses the same issue as PR gjtorikian#90 but with a simpler, more maintainable approach. If accepted, PR gjtorikian#90 will be closed.

f14XuanLv · 2025-11-08T16:54:10Z

Update: Superseded by #91

After further refinement, I've created PR #91 which solves the same UTF-8 boundary truncation issue with a simpler, more maintainable approach.

Key Improvements in #91:

✅ Minimal code changes (+91/-9 vs +131/-28 in this PR)
✅ Cleaner solution using scanBytes variable to separate scan range from validation range

If #91 gets merged, I will close this PR. Both PRs fix the same bug with identical test coverage, but #91 achieves it with less complexity.

f14XuanLv added 2 commits November 8, 2025 20:10

f14XuanLv marked this pull request as draft November 8, 2025 15:31

f14XuanLv mentioned this pull request Nov 8, 2025

fix: handle UTF-8 multibyte sequences truncated at buffer boundary #91

Merged

gjtorikian closed this in #91 Nov 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: handle UTF-8 multibyte sequences truncated at buffer boundary #90

fix: handle UTF-8 multibyte sequences truncated at buffer boundary #90

Uh oh!

f14XuanLv commented Nov 8, 2025 •

edited

Loading

Uh oh!

f14XuanLv commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

fix: handle UTF-8 multibyte sequences truncated at buffer boundary #90

fix: handle UTF-8 multibyte sequences truncated at buffer boundary #90

Uh oh!

Conversation

f14XuanLv commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: UTF-8 Multibyte Sequences Truncated at Buffer Boundary

Problem

Root Cause

Impact

Solution

Changes

Technical Details

How It Works

Tests

Added 7 Test Cases (All Text Files)

Verification

Test Coverage

Related

Uh oh!

f14XuanLv commented Nov 8, 2025

Update: Superseded by #91

Key Improvements in #91:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

f14XuanLv commented Nov 8, 2025 •

edited

Loading