Skip to content

Conversation

@wardlican
Copy link
Contributor

Why are the changes needed?

Amoro optimization can result in the input files and the merged output files having the same number of files, and this can cause the merge to fail and keep triggering the merge task.

Close #3855

Brief change log

  1. For undersizedSegmentFiles that result in only one file after bin-packing:
  2. If rewriting pos delete is not required, simply add them to rewriteDataFiles.
  3. During the second bin-packing, they can be merged with fragment files or other files.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before making a pull request

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@wardlican
Copy link
Contributor Author

wardlican commented Oct 31, 2025

Please help review whether this repair plan is feasible.

Fix Results:

  • When the total size of the input files is less than targetSize, return Long.MAX_VALUE (existing logic).
  • When the total size of the input files is greater than or equal to targetSize but the average file size is less than targetSize, also return Long.MAX_VALUE (new logic).
  • Only return targetSize when the average file size is greater than or equal to targetSize.

This ensures that:

multi small files (even if the total size is greater than or equal to 128MB, but the average file size is less than 128MB) can be merged.

  • Avoiding the generation of output files with the same number of input files.
  • Solving the file merging problem at the task execution level.
  • The fix is ​​complete; small files should now be merged correctly.

@xxubai xxubai requested a review from zhongqishang November 6, 2025 02:35
wardli added 2 commits November 10, 2025 19:13
…d output files having the same number of files, and this can cause the merge to fail and keep triggering the merge task. apache#3855
if (inputSize < targetSize) {
return Long.MAX_VALUE;
}
// Even if total size >= targetSize, if average file size is small (less than targetSize),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this output a single file with too big file size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants