Skip to content

Conversation

@f14XuanLv
Copy link
Contributor

@f14XuanLv f14XuanLv commented Nov 8, 2025

Fix: UTF-8 Multibyte Sequences Truncated at Buffer Boundary 2

Problem

Valid UTF-8 text files were incorrectly identified as binary when multibyte UTF-8 sequences were truncated at the 512-byte (MAX_BYTES) buffer boundary.

Root Cause

The latest commit version (272db64) before this PR performs boundary checks like i + N < totalBytes (lines 239, 244, 250) before validating UTF-8 sequences. When a multibyte sequence starts near the buffer boundary (positions 509-511), this check fails because the continuation bytes at positions 512+ are beyond totalBytes, preventing validation. As a result, valid UTF-8 sequences are flagged as suspicious bytes, causing text files to be misidentified as binary.

Example:

File: [509 bytes] + [0xF0  0x9F  0x98  0x80]  (4-byte emoji at positions 509-512)
                    └Position509-511┘  └Position512 (unreachable)

Original code check:
  if (i + 3 < totalBytes)  // 509 + 3 < 512? → 512 < 512? → FALSE ❌
  // Validation never happens, 0xF0 marked as suspicious

Impact

This bug affects any text file containing:

  • Multibyte UTF-8 sequences (2-4 bytes) that cross the 512-byte boundary
  • Common in files with: Chinese, Japanese, Korean, emoji, or other non-ASCII Unicode characters

The 7 test files added in this PR had 4 incorrectly identified as binary in the current version (as of commit 272db64) due to this truncation issue.

Solution

Fix false positives where valid UTF-8 sequences were incorrectly flagged as binary when truncated at the MAX_BYTES boundary.

Changes

  • Introduce scanBytes variable to separate scan range from validation range
  • Increase read buffer to MAX_BYTES + UTF8_BOUNDARY_RESERVE (515 bytes) to capture complete sequences
  • Maintain original scan logic by limiting the main loop to scanBytes (512 bytes)
  • Enable cross-boundary validation by allowing UTF-8 checks to access up to totalBytes (515 bytes)

Implementation Details

The solution elegantly separates two concerns:

  1. Scanning range (scanBytes): Maintains the original 512-byte limit for:

    • Main loop iteration boundary
    • Suspicious byte percentage calculations
    • Binary detection thresholds
  2. Validation range (totalBytes): Extends to 515 bytes for:

    • UTF-8 continuation byte validation
    • Complete sequence verification

How It Works

// New approach: Separate scan from validation
const totalBytes = Math.min(bytesRead, MAX_BYTES + UTF8_BOUNDARY_RESERVE); // Validate up to 515
const scanBytes = Math.min(totalBytes, MAX_BYTES);           // Loop up to 512

// Main loop uses scanBytes
for (let i = 0; i < scanBytes; i++) {
  // UTF-8 validation can access totalBytes
  if (fileBuffer[i] >= 0xf0 && fileBuffer[i] <= 0xf7 && i + 3 < totalBytes) {
    // ✅ At position 509: 509 + 3 < 515 → TRUE (can validate!)
  }
}

This minimal change preserves all existing binary detection behavior while fixing the boundary truncation issue.

Tests

Added 7 Test Cases (All Text Files)

These files demonstrate various UTF-8 multibyte sequences at critical boundary positions:

  1. 508A-4byte.txt - 4-byte UTF-8 sequence starting at position 508
  2. 509A-3byte.txt - 3-byte UTF-8 sequence starting at position 509
  3. 509A-4byte.txt - 4-byte UTF-8 sequence starting at position 509
  4. 510A-2byte.txt - 2-byte UTF-8 sequence starting at position 510
  5. 510A-3byte.txt - 3-byte UTF-8 sequence starting at position 510
  6. 510A-4byte.txt - 4-byte UTF-8 sequence starting at position 510
  7. utf8-boundary-truncation_case.py - Real-world Python file with Chinese characters (3-byte UTF-8 at positions 510-512)

Verification

All 7 files are valid text files containing UTF-8 sequences at the buffer boundary.

Current version (commit 272db64, before this fix):

  • ❌ 4 files incorrectly identified as BINARY:

    • 509A-4byte.txt
    • 510A-3byte.txt
    • 510A-4byte.txt
    • utf8-boundary-truncation_case.py
  • ✅ 3 files correctly identified as text

  • Reason: UTF-8 sequences at boundary cannot be validated, flagged as suspicious bytes. When suspicious bytes > 1, the logic at 272db64:src/index.ts#L277-L279:

      if (suspiciousBytes > 1 && isBinaryProto(fileBuffer, scanBytes)) {
        return true;
      }

    marks them as binary

After this PR:

  • ✅ All 7 files correctly identified as TEXT
  • ✅ Total: 40/40 tests pass

Test Coverage

npm test
# Test Suites: 1 passed, 1 total
# Tests:       40 passed, 40 total

Related

Fix false positives where valid UTF-8 sequences were incorrectly
flagged as binary when truncated at the MAX_BYTES boundary.

Changes:
- Introduce scanBytes variable to separate scan range from validation range
- Read extra bytes (MAX_BYTES + UTF8_BOUNDARY_RESERVE) to capture
  complete sequences at boundary
- Maintain MAX_BYTES scan limit for binary detection logic
- Enable UTF-8 validation to access up to MAX_BYTES + UTF8_BOUNDARY_RESERVE

Tests:
- Add 7 boundary test cases including real-world Python file with
  Chinese characters (utf8-boundary-truncation_case.py)
- Covers 2/3/4-byte UTF-8 sequences at positions near MAX_BYTES boundary
- All 40 tests pass

Technical details:
- Minimal change preserving all existing UTF-8 detection logic
- scanBytes controls loop boundary and percentage calculations
- totalBytes allows validation of sequences crossing MAX_BYTES boundary
- Maintains backward compatibility and binary detection thresholds

This addresses the same issue as PR gjtorikian#90 but with a simpler, more
maintainable approach. If accepted, PR gjtorikian#90 will be closed.
@f14XuanLv
Copy link
Contributor Author

Hi @gjtorikian 👋

This library is excellent! I've been using isBinaryFile in my projects and it works really well.

I encountered an issue with UTF-8 multibyte sequences being truncated at buffer boundaries, which could cause incorrect binary file detection. I've submitted this PR to fix it.

Would appreciate if you could review when you have time, so that I can officially update it into the projects I maintain.

Thanks!

@gjtorikian
Copy link
Owner

looks great to me, thanks! I'll release it as a patch bump.

@gjtorikian gjtorikian merged commit a9d483b into gjtorikian:main Nov 11, 2025
3 checks passed
f14XuanLv added a commit to f14XuanLv/Roo-Code that referenced this pull request Nov 12, 2025
- Updates constraint from ^5.0.2 to ^5.0.7
- Fixes 'invalid array length' crash with UTF-8 files that should be recognized as text

Upstream library fix: gjtorikian/isBinaryFile#91
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants