Skip to content

Conversation

@f14XuanLv
Copy link
Contributor

@f14XuanLv f14XuanLv commented Nov 8, 2025

Fix: UTF-8 Multibyte Sequences Truncated at Buffer Boundary

Problem

Valid UTF-8 text files were incorrectly identified as binary when multibyte UTF-8 sequences were truncated at the 512-byte (MAX_BYTES) buffer boundary.

Root Cause

The implementation as of commit 272db64 (the latest version before this PR) performs boundary checks like i + N < totalBytes before validating UTF-8 sequences. When a multibyte sequence starts near the buffer boundary (positions 509-511), this check fails because the continuation bytes at positions 512+ are beyond totalBytes, preventing validation. As a result, valid UTF-8 sequences are flagged as suspicious bytes, causing text files to be misidentified as binary.

Example:

File: [509 bytes] + [0xF0 0x9F 0x98 0x80]  (4-byte emoji at positions 509-512)
                     └─ Position 509 ─┘    └─ Position 512 (unreachable)

Original code check:
  if (i + 3 < totalBytes)  // 509 + 3 < 512? → 512 < 512? → FALSE ❌
  // Validation never happens, 0xF0 marked as suspicious

Impact

This bug affects any text file containing:

  • Multibyte UTF-8 sequences (2-4 bytes) that cross the 512-byte boundary
  • Common in files with: Chinese, Japanese, Korean, emoji, or other non-ASCII Unicode characters

The 7 test files added in this PR had 4 incorrectly identified as binary in the current version (as of commit 272db64) due to this truncation issue.

Solution

Fix false positives where valid UTF-8 sequences were incorrectly flagged as binary when truncated at the MAX_BYTES boundary.

Changes

  • Read extra bytes (READ_BUFFER_SIZE = MAX_BYTES + MAX_UTF8_BYTE_LENGTH - 1 = 515) to validate sequences that start near MAX_BYTES boundary
  • Add type-safe configuration (ValidUTF8Length = 1|2|3|4) to enforce valid UTF-8 sequence lengths at compile time
  • Refactor UTF-8 detection to generic boundary-aware validation that separates detection (sequence length) from validation (continuation bytes)
  • Fix failed sequence handling to match original behavior (skip first continuation byte to avoid double-counting suspicious bytes)

Technical Details

  • Separates detection from validation to eliminate circular dependency (need to know sequence length before setting boundary parameters)
  • Uses bytesRead for validation range while maintaining totalBytes for scan range (scan 0-511, but can validate up to byte 514)
  • Maintains translation invariance: changing MAX_BYTES or MAX_UTF8_BYTE_LENGTH automatically adjusts all logic without code changes

How It Works

// Original: Hard-coded boundary check
if (i + 3 < totalBytes) { /* validate 4-byte sequence */ }
// ❌ Fails at boundary: 509 + 3 < 512 → FALSE

// Fixed: Dynamic boundary-aware validation
expectedLength = 4;
maxCheckLimit = Math.min(i + expectedLength, bytesRead);
if (maxCheckLimit >= i + expectedLength) { /* validate */ }
// ✅ Works at boundary: min(509+4, 515) >= 513 → TRUE

Tests

Added 7 Test Cases (All Text Files)

These files demonstrate various UTF-8 multibyte sequences at critical boundary positions:

  1. 508A-4byte.txt - 4-byte UTF-8 sequence starting at position 508
  2. 509A-3byte.txt - 3-byte UTF-8 sequence starting at position 509
  3. 509A-4byte.txt - 4-byte UTF-8 sequence starting at position 509
  4. 510A-2byte.txt - 2-byte UTF-8 sequence starting at position 510
  5. 510A-3byte.txt - 3-byte UTF-8 sequence starting at position 510
  6. 510A-4byte.txt - 4-byte UTF-8 sequence starting at position 510
  7. utf8-boundary-truncation_case.py - Real-world Python file with Chinese characters (3-byte UTF-8 at positions 510-512)

Verification

All 7 files are valid text files containing UTF-8 sequences at the buffer boundary.

Current version (commit 272db64, before this fix):

  • ❌ 4 files incorrectly identified as BINARY:

    • 509A-4byte.txt
    • 510A-3byte.txt
    • 510A-4byte.txt
    • utf8-boundary-truncation_case.py
  • ✅ 3 files correctly identified as text

  • Reason: UTF-8 sequences at boundary cannot be validated, flagged as suspicious bytes. When suspicious bytes > 1, the src/index.ts line 277~279

      if (suspiciousBytes > 1 && isBinaryProto(fileBuffer, scanBytes)) {
        return true;
      }
    

    marks them as binary

After this PR:

  • ✅ All 7 files correctly identified as TEXT
  • ✅ Total: 40/40 tests pass

Test Coverage

npm test
# Test Suites: 1 passed, 1 total
# Tests:       40 passed, 40 total

Related

  • Fixes boundary truncation issue for UTF-8 multibyte sequences
  • Related to similar boundary handling in protobuf detection (commit d39d2c0)

Fix false positives where valid UTF-8 sequences were incorrectly
flagged as binary when truncated at the MAX_BYTES boundary.

Changes:
- Read extra bytes (READ_BUFFER_SIZE = MAX_BYTES + MAX_UTF8_BYTE_LENGTH - 1)
  to validate sequences that start near MAX_BYTES boundary
- Add type-safe configuration (ValidUTF8Length = 1|2|3|4)
- Refactor UTF-8 detection to generic boundary-aware validation
- Fix failed sequence handling to match original behavior (skip
  first continuation byte to avoid double-counting)

Tests:
- Add 6 boundary test cases (508A-4byte, 509A-3byte, 509A-4byte,
  510A-2byte, 510A-3byte, 510A-4byte)
- All 39 tests pass

Technical details:
- Separates detection (sequence length) from validation (continuation
  bytes) to eliminate circular dependency
- Uses bytesRead for validation range while maintaining totalBytes
  for scan range
- Maintains translation invariance: changing MAX_BYTES or
  MAX_UTF8_BYTE_LENGTH automatically adjusts all logic
Add utf8-boundary-truncation_case.py as a regression test for the
boundary truncation bug. This Python file contains Chinese characters
with a 3-byte UTF-8 sequence (0xE5 0xAF 0xB8) at positions 510-512,
which triggers the original bug where sequences truncated at the
512-byte boundary were incorrectly flagged as binary.

The file demonstrates a real-world scenario where valid text files
were misidentified due to UTF-8 multibyte sequences crossing the
buffer boundary.

Test: Expect false (text file, not binary)
@f14XuanLv f14XuanLv marked this pull request as draft November 8, 2025 15:31
f14XuanLv added a commit to f14XuanLv/isBinaryFile that referenced this pull request Nov 8, 2025
Fix false positives where valid UTF-8 sequences were incorrectly
flagged as binary when truncated at the MAX_BYTES boundary.

Changes:
- Introduce scanBytes variable to separate scan range from validation range
- Read extra bytes (MAX_BYTES + UTF8_BOUNDARY_RESERVE) to capture
  complete sequences at boundary
- Maintain MAX_BYTES scan limit for binary detection logic
- Enable UTF-8 validation to access up to MAX_BYTES + UTF8_BOUNDARY_RESERVE

Tests:
- Add 7 boundary test cases including real-world Python file with
  Chinese characters (utf8-boundary-truncation_case.py)
- Covers 2/3/4-byte UTF-8 sequences at positions near MAX_BYTES boundary
- All 40 tests pass

Technical details:
- Minimal change preserving all existing UTF-8 detection logic
- scanBytes controls loop boundary and percentage calculations
- totalBytes allows validation of sequences crossing MAX_BYTES boundary
- Maintains backward compatibility and binary detection thresholds

This addresses the same issue as PR gjtorikian#90 but with a simpler, more
maintainable approach. If accepted, PR gjtorikian#90 will be closed.
@f14XuanLv
Copy link
Contributor Author

Update: Superseded by #91

After further refinement, I've created PR #91 which solves the same UTF-8 boundary truncation issue with a simpler, more maintainable approach.

Key Improvements in #91:

  • Minimal code changes (+91/-9 vs +131/-28 in this PR)
  • Cleaner solution using scanBytes variable to separate scan range from validation range

If #91 gets merged, I will close this PR. Both PRs fix the same bug with identical test coverage, but #91 achieves it with less complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant