Intelligent encoding detection #4

peteretelej · 2025-07-19T12:44:36Z

We currently try to parse using popular encoding formats ("utf-8", "utf-8-sig", "cp1252", "latin-1"). This is very limiting since diffchunk is designed to be a foundational tool. Current design constrains us to a limited set of encoding for files. Leading to bugs such as #1

This change uses https://github.com/chardet/chardet to smartly detect the encoding for a versatile parser. We only care about the text anyway, so we should offload that compute to a tool better suited.

This actually resolves #1

Copilot

Pull Request Overview

This PR replaces the hardcoded encoding fallback list with intelligent encoding detection using the chardet library to improve the robustness of diff file parsing across various encoding formats.

Integrates chardet library for automatic encoding detection instead of trying a fixed list of encodings
Updates error messages to be more descriptive about parsing failures
Refactors tests to focus on encoding support rather than Windows-specific issues

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/parser.py`	Replaces hardcoded encoding list with chardet-based detection and improved fallback handling
`src/chunker.py`	Updates error message to be more descriptive about empty diff files
`tests/test_encodings.py`	Adds comprehensive encoding tests including UTF-16 detection
`tests/test_windows_repro.py`	Removes Windows-specific test file
`pyproject.toml`	Adds chardet dependency

src/parser.py

codecov · 2025-07-19T12:46:19Z

Codecov Report

Attention: Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 94.29%. Comparing base (3ad3ee1) to head (65f9b18).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main       #4      +/-   ##
==========================================
+ Coverage   94.27%   94.29%   +0.02%     
==========================================
  Files           7        7              
  Lines         419      421       +2     
==========================================
+ Hits          395      397       +2     
  Misses         24       24

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Intelligent encoding detection

ee6878a

peteretelej requested a review from Copilot July 19, 2025 12:44

Copilot AI reviewed Jul 19, 2025

View reviewed changes

src/parser.py Show resolved Hide resolved

src/parser.py Show resolved Hide resolved

src/parser.py Show resolved Hide resolved

peteretelej added 2 commits July 19, 2025 15:50

bump up version

ab5b6c0

Add codecov yaml config

65f9b18

peteretelej merged commit ee2bbd5 into main Jul 19, 2025
9 checks passed

peteretelej deleted the feat/encoding-support branch July 19, 2025 13:00

peteretelej mentioned this pull request Jul 20, 2025

[Bug]: windows: some diffs fail with "No valid diff content found" #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intelligent encoding detection #4

Intelligent encoding detection #4

Uh oh!

peteretelej commented Jul 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jul 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Intelligent encoding detection #4

Intelligent encoding detection #4

Uh oh!

Conversation

peteretelej commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

peteretelej commented Jul 19, 2025 •

edited

Loading

codecov bot commented Jul 19, 2025 •

edited

Loading