Skip to content

Conversation

@imertz
Copy link
Contributor

@imertz imertz commented Apr 29, 2025

Pull Request: Fix XML validation error in OutputGitRepoXML function

Description

This PR fixes the XML validation error that occurs when using the -x flag to export repositories as XML. The specific error message was:

Error: XML validation error on line 494: invalid character entity & (no semicolon)

Changes Made

  • Improved the OutputGitRepoXML function in prompt/prompt.go to properly handle XML special characters
  • Added proper escaping of XML special characters (&, <, >, ", ') in file paths
  • Implemented safe handling of CDATA sections with protection against premature termination
  • Fixed XML formatting with consistent indentation and structure
  • Simplified token placeholder replacement without breaking formatting

Testing Done

I tested this fix by:

  1. Running git2gpt -x -s -e -o output.xml . on repositories containing files with special XML characters
  2. Verifying the generated XML passes validation
  3. Testing with files containing potential CDATA terminators (]]>) to ensure they're properly escaped

Related Issue

Fixes #15

Before/After

Before: The tool fails with XML validation errors when files contain special characters

After: The tool successfully generates valid XML regardless of file content

imertz added 3 commits April 29, 2025 11:08
- Fixed XML generation to properly handle special characters and CDATA sections
- Added protection against premature CDATA termination by escaping "]]>" sequences
- Improved XML formatting with consistent indentation and structure
- Simplified token placeholder replacement without breaking formatting
This commit adds support for a .gptinclude file, which allows users to
explicitly specify which files should be included in the repository export.
The feature complements the existing .gptignore functionality:

- When both .gptinclude and .gptignore exist, files are first filtered
  by the include patterns, then any matching ignore patterns are excluded
- Added new command-line flag: -I/--include to specify a custom path
  to the .gptinclude file
- Default behavior looks for .gptinclude in repository root
- Added comprehensive tests for the new functionality
- Updated README.md with documentation and examples

With this change, users gain more fine-grained control over which parts
of their repositories are processed by git2gpt, making it easier to focus
on specific areas when working with AI language models.
This commit fixes an issue where the XML export would fail with
"unexpected EOF in CDATA section" errors when file content contained
the CDATA end marker sequence ']]>'.

The fix implements a proper CDATA handling strategy that:
- Detects all occurrences of ']]>' in file content
- Splits the content around these markers
- Creates properly nested CDATA sections to preserve the original content
- Ensures all XML output is well-formed regardless of source content

This approach maintains the efficiency of CDATA for storing large code
blocks while ensuring compatibility with all possible file content.

Fixes the XML validation error that would occur when processing files
containing CDATA end marker sequences.
@chand1012
Copy link
Owner

I'm not gonna be super picky however your commits on this PR and #17 are the same so this one will close #15 and add the new feature.

Usually what I do on forks is make a separate branch on my fork and make the BR from that new branch for each feature or fix.

Thanks for the contribution!

@chand1012 chand1012 requested a review from Copilot April 29, 2025 13:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes XML validation errors and improves file filtering by adding support for an include list alongside the existing ignore list. Key changes include:

  • Enhancing the OutputGitRepoXML function to safely handle CDATA sections.
  • Introducing functions to generate and process a .gptinclude file alongside .gptignore.
  • Updating command-line flags and function signatures in ProcessGitRepo and processRepository.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

File Description
prompt/prompt.go Updated XML output handling, added include list functions, and adjusted repo processing logic.
prompt/gptinclude_test.go Added test cases for the new include filtering functionality.
cmd/root.go Added a new flag for the include file and updated ProcessGitRepo invocation.
README.md Updated documentation to cover include file usage along with ignore file usage.
Files not reviewed (1)
  • .gptinclude: Language not supported
Comments suppressed due to low confidence (2)

prompt/prompt.go:324

  • [nitpick] Consider renaming variable 'process' to 'shouldProcess' for improved readability and clarity.
process := shouldProcess(relativeFilePath, includeList, ignoreList)

prompt/prompt.go:135

  • Consider handling the error returned from getIgnoreList instead of ignoring it, to ensure any issues with reading the ignore file are properly reported.
ignoreList, _ = getIgnoreList(ignoreFilePath)


// Split content around CDATA end marker (]]>) and create multiple CDATA sections
contents := file.Contents
result.WriteString(" <contents>")
Copy link

Copilot AI Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that the algorithm splitting file.Contents into multiple CDATA sections properly handles consecutive ']]>' sequences to avoid malformed XML.

Copilot uses AI. Check for mistakes.
@imertz
Copy link
Contributor Author

imertz commented Apr 29, 2025

I'm not gonna be super picky however your commits on this PR and #17 are the same so this one will close #15 and add the new feature.

Usually what I do on forks is make a separate branch on my fork and make the BR from that new branch for each feature or fix.

Thanks for the contribution!

You are right. I messed up with the branches. Thank you again for this amazing project!

@chand1012 chand1012 merged commit 7555624 into chand1012:main Apr 29, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

XML validation error when using the -x flag due to unescaped special characters

2 participants