Skip to content

Conversation

peteretelej
Copy link
Owner

@peteretelej peteretelej commented Jul 19, 2025

We currently try to parse using popular encoding formats ("utf-8", "utf-8-sig", "cp1252", "latin-1"). This is very limiting since diffchunk is designed to be a foundational tool. Current design constrains us to a limited set of encoding for files. Leading to bugs such as #1

This change uses https://github.com/chardet/chardet to smartly detect the encoding for a versatile parser. We only care about the text anyway, so we should offload that compute to a tool better suited.

This actually resolves #1

@peteretelej peteretelej requested a review from Copilot July 19, 2025 12:44
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR replaces the hardcoded encoding fallback list with intelligent encoding detection using the chardet library to improve the robustness of diff file parsing across various encoding formats.

  • Integrates chardet library for automatic encoding detection instead of trying a fixed list of encodings
  • Updates error messages to be more descriptive about parsing failures
  • Refactors tests to focus on encoding support rather than Windows-specific issues

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/parser.py Replaces hardcoded encoding list with chardet-based detection and improved fallback handling
src/chunker.py Updates error message to be more descriptive about empty diff files
tests/test_encodings.py Adds comprehensive encoding tests including UTF-16 detection
tests/test_windows_repro.py Removes Windows-specific test file
pyproject.toml Adds chardet dependency

Copy link

codecov bot commented Jul 19, 2025

Codecov Report

Attention: Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 94.29%. Comparing base (3ad3ee1) to head (65f9b18).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main       #4      +/-   ##
==========================================
+ Coverage   94.27%   94.29%   +0.02%     
==========================================
  Files           7        7              
  Lines         419      421       +2     
==========================================
+ Hits          395      397       +2     
  Misses         24       24              
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@peteretelej peteretelej merged commit ee2bbd5 into main Jul 19, 2025
9 checks passed
@peteretelej peteretelej deleted the feat/encoding-support branch July 19, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: windows: some diffs fail with "No valid diff content found"

1 participant