fix: handle UTF-8 multibyte sequences truncated at buffer boundary #91

f14XuanLv · 2025-11-08T16:34:38Z

Fix: UTF-8 Multibyte Sequences Truncated at Buffer Boundary 2

Problem

Valid UTF-8 text files were incorrectly identified as binary when multibyte UTF-8 sequences were truncated at the 512-byte (MAX_BYTES) buffer boundary.

Root Cause

The latest commit version (272db64) before this PR performs boundary checks like i + N < totalBytes (lines 239, 244, 250) before validating UTF-8 sequences. When a multibyte sequence starts near the buffer boundary (positions 509-511), this check fails because the continuation bytes at positions 512+ are beyond totalBytes, preventing validation. As a result, valid UTF-8 sequences are flagged as suspicious bytes, causing text files to be misidentified as binary.

Example:

File: [509 bytes] + [0xF0  0x9F  0x98  0x80]  (4-byte emoji at positions 509-512)
                    └Position509-511┘  └Position512 (unreachable)

Original code check:
  if (i + 3 < totalBytes)  // 509 + 3 < 512? → 512 < 512? → FALSE ❌
  // Validation never happens, 0xF0 marked as suspicious

Impact

This bug affects any text file containing:

Multibyte UTF-8 sequences (2-4 bytes) that cross the 512-byte boundary
Common in files with: Chinese, Japanese, Korean, emoji, or other non-ASCII Unicode characters

The 7 test files added in this PR had 4 incorrectly identified as binary in the current version (as of commit 272db64) due to this truncation issue.

Solution

Fix false positives where valid UTF-8 sequences were incorrectly flagged as binary when truncated at the MAX_BYTES boundary.

Changes

Introduce scanBytes variable to separate scan range from validation range
Increase read buffer to MAX_BYTES + UTF8_BOUNDARY_RESERVE (515 bytes) to capture complete sequences
Maintain original scan logic by limiting the main loop to scanBytes (512 bytes)
Enable cross-boundary validation by allowing UTF-8 checks to access up to totalBytes (515 bytes)

Implementation Details

The solution elegantly separates two concerns:

Scanning range (scanBytes): Maintains the original 512-byte limit for:
- Main loop iteration boundary
- Suspicious byte percentage calculations
- Binary detection thresholds
Validation range (totalBytes): Extends to 515 bytes for:
- UTF-8 continuation byte validation
- Complete sequence verification

How It Works

// New approach: Separate scan from validation
const totalBytes = Math.min(bytesRead, MAX_BYTES + UTF8_BOUNDARY_RESERVE); // Validate up to 515
const scanBytes = Math.min(totalBytes, MAX_BYTES);           // Loop up to 512

// Main loop uses scanBytes
for (let i = 0; i < scanBytes; i++) {
  // UTF-8 validation can access totalBytes
  if (fileBuffer[i] >= 0xf0 && fileBuffer[i] <= 0xf7 && i + 3 < totalBytes) {
    // ✅ At position 509: 509 + 3 < 515 → TRUE (can validate!)
  }
}

This minimal change preserves all existing binary detection behavior while fixing the boundary truncation issue.

Tests

Added 7 Test Cases (All Text Files)

These files demonstrate various UTF-8 multibyte sequences at critical boundary positions:

508A-4byte.txt - 4-byte UTF-8 sequence starting at position 508
509A-3byte.txt - 3-byte UTF-8 sequence starting at position 509
509A-4byte.txt - 4-byte UTF-8 sequence starting at position 509
510A-2byte.txt - 2-byte UTF-8 sequence starting at position 510
510A-3byte.txt - 3-byte UTF-8 sequence starting at position 510
510A-4byte.txt - 4-byte UTF-8 sequence starting at position 510
utf8-boundary-truncation_case.py - Real-world Python file with Chinese characters (3-byte UTF-8 at positions 510-512)

Verification

All 7 files are valid text files containing UTF-8 sequences at the buffer boundary.

Current version (commit 272db64, before this fix):

❌ 4 files incorrectly identified as BINARY:
- 509A-4byte.txt
- 510A-3byte.txt
- 510A-4byte.txt
- utf8-boundary-truncation_case.py
✅ 3 files correctly identified as text
Reason: UTF-8 sequences at boundary cannot be validated, flagged as suspicious bytes. When suspicious bytes > 1, the logic at 272db64:src/index.ts#L277-L279:
```
  if (suspiciousBytes > 1 && isBinaryProto(fileBuffer, scanBytes)) {
    return true;
  }
```
marks them as binary

After this PR:

✅ All 7 files correctly identified as TEXT
✅ Total: 40/40 tests pass

Test Coverage

npm test
# Test Suites: 1 passed, 1 total
# Tests:       40 passed, 40 total

Fixes boundary truncation issue for UTF-8 multibyte sequences
Related to similar boundary handling in protobuf detection (commit d39d2c0)
Addresses the same issue as PR fix: handle UTF-8 multibyte sequences truncated at buffer boundary #90 but with a simpler, more maintainable approach
fix: handle UTF-8 multibyte sequences truncated at buffer boundary #90 is currently set to draft, and if this PR is accepted, I will close fix: handle UTF-8 multibyte sequences truncated at buffer boundary #90

Fix false positives where valid UTF-8 sequences were incorrectly flagged as binary when truncated at the MAX_BYTES boundary. Changes: - Introduce scanBytes variable to separate scan range from validation range - Read extra bytes (MAX_BYTES + UTF8_BOUNDARY_RESERVE) to capture complete sequences at boundary - Maintain MAX_BYTES scan limit for binary detection logic - Enable UTF-8 validation to access up to MAX_BYTES + UTF8_BOUNDARY_RESERVE Tests: - Add 7 boundary test cases including real-world Python file with Chinese characters (utf8-boundary-truncation_case.py) - Covers 2/3/4-byte UTF-8 sequences at positions near MAX_BYTES boundary - All 40 tests pass Technical details: - Minimal change preserving all existing UTF-8 detection logic - scanBytes controls loop boundary and percentage calculations - totalBytes allows validation of sequences crossing MAX_BYTES boundary - Maintains backward compatibility and binary detection thresholds This addresses the same issue as PR gjtorikian#90 but with a simpler, more maintainable approach. If accepted, PR gjtorikian#90 will be closed.

f14XuanLv · 2025-11-10T15:53:46Z

Hi @gjtorikian 👋

This library is excellent! I've been using isBinaryFile in my projects and it works really well.

I encountered an issue with UTF-8 multibyte sequences being truncated at buffer boundaries, which could cause incorrect binary file detection. I've submitted this PR to fix it.

Would appreciate if you could review when you have time, so that I can officially update it into the projects I maintain.

Thanks!

gjtorikian · 2025-11-11T20:58:22Z

looks great to me, thanks! I'll release it as a patch bump.

- Updates constraint from ^5.0.2 to ^5.0.7 - Fixes 'invalid array length' crash with UTF-8 files that should be recognized as text Upstream library fix: gjtorikian/isBinaryFile#91

f14XuanLv mentioned this pull request Nov 8, 2025

fix: handle UTF-8 multibyte sequences truncated at buffer boundary #90

Closed

gjtorikian merged commit a9d483b into gjtorikian:main Nov 11, 2025
3 checks passed

f14XuanLv mentioned this pull request Nov 12, 2025

fix: upgrade isbinaryfile from 5.0.4 to 5.0.7 RooCodeInc/Roo-Code#9192

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: handle UTF-8 multibyte sequences truncated at buffer boundary #91

fix: handle UTF-8 multibyte sequences truncated at buffer boundary #91

Uh oh!

f14XuanLv commented Nov 8, 2025 •

edited

Loading

Uh oh!

f14XuanLv commented Nov 10, 2025

Uh oh!

gjtorikian commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix: handle UTF-8 multibyte sequences truncated at buffer boundary #91

fix: handle UTF-8 multibyte sequences truncated at buffer boundary #91

Uh oh!

Conversation

f14XuanLv commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: UTF-8 Multibyte Sequences Truncated at Buffer Boundary 2

Problem

Root Cause

Impact

Solution

Changes

Implementation Details

How It Works

Tests

Added 7 Test Cases (All Text Files)

Verification

Test Coverage

Related

Uh oh!

f14XuanLv commented Nov 10, 2025

Uh oh!

gjtorikian commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

f14XuanLv commented Nov 8, 2025 •

edited

Loading