-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Refactor: add FileGroup structure for Vec<PartitionedFile>
#15379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /// The files in this group | ||
| pub files: Vec<PartitionedFile>, | ||
| /// Optional statistics for all files in the group | ||
| pub statistics: Option<Statistics>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can compute the statistics in the method of ListingTable
datafusion/datafusion/core/src/datasource/listing/table.rs
Lines 1086 to 1091 in 5210a2b
| async fn list_files_for_scan<'a>( | |
| &'a self, | |
| ctx: &'a dyn Session, | |
| filters: &'a [Expr], | |
| limit: Option<usize>, | |
| ) -> Result<(Vec<Vec<PartitionedFile>>, Statistics)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend:
- Remove the
pubfrom the fields (so we can potentially change the representation later) - Add a
into_inner()method that returns the inner Vec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: 7415f7c
cda1c47 to
1dc5153
Compare
1dc5153 to
7df36ed
Compare
| ) -> Result<(Vec<PartitionedFile>, Statistics)> { | ||
| let mut result_files = vec![]; | ||
| ) -> Result<(FileGroup, Statistics)> { | ||
| let mut result_files = FileGroup::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the difference between the statistics in FileGroup and the returned value here? Do we need statistics in FileGroup, if so, could we add the returned Statistics inside FileGroup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return result_files.with_statistics(statistics)? I don't see any other place where with_statistics is called. The inner optional statistics in FileGroup is a bit confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 , yeah, it's a todo task, I'll compute the FileGroup's statistic in the method and use the with_statistics method to set statistics for FileGroup later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fyi: #15432
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @xudong963
I think encapsulating Vec<PartitionedFile> into FileGroup is a very nice idea and long in coming. However I do think it is an API change so I have maked this PR with that tag.
I also added this to the list of API changes that should have entries in the upgrade guide on
| const CONCURRENCY_LIMIT: usize = 100; | ||
|
|
||
| /// Partition the list of files into `n` groups | ||
| pub fn split_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an API change -- can you perhaps leave the function here and mark it #deprecated per https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 7415f7c
| /// The files in this group | ||
| pub files: Vec<PartitionedFile>, | ||
| /// Optional statistics for all files in the group | ||
| pub statistics: Option<Statistics>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend:
- Remove the
pubfrom the fields (so we can potentially change the representation later) - Add a
into_inner()method that returns the inner Vec
| } | ||
|
|
||
| /// Partition the list of files into `n` groups | ||
| pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than taking &mut self here, I think this would be easier to use if it took mut self
As written it will leave self empty which is not super obvious. If it took mut self then it would be clear that the FileGroup is consumed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes a lot sense
| let files = iter.into_iter().collect(); | ||
| FileGroup::new(files) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend a From impl for the vec
| } | |
| } | |
| impl From<Vec<PartitionedFile>> for FileGroup { | |
| ... | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, added in 7415f7c
Now I have three ways to create FileGroup by Vec<PartitionedFile>.
let files = vec![...];
let file_group = FileGroup::new(files);
let file_group = files.into();
let file_group = FileGroup::from(files);I don't strongly prefer which method to use, but the second is simpler!
| } | ||
|
|
||
| /// Partition the list of files into `n` groups | ||
| pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than taking &mut self here, I think this would be easier to use if it took mut self
As written it will leave self empty which is not super obvious. If it took mut self then it would be clear that the FileGroup is consumed
FileGroup structureFileGroup structure for Vec<PartitionedFile>
Yes, I agree. Thank you for adding it. |
|
Thanks for your review. I'll merge it after @alamb has another look! |
|
|
||
| /// Partition the list of files into `n` groups | ||
| pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> { | ||
| pub fn split_files(mut self, n: usize) -> Vec<FileGroup> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
|
Thanks @xudong963 @alamb |
…e#15379) * Refactor: add FileGroup structure * update * address comments * address comments from alamb * fix clippy
…e#15379) * Refactor: add FileGroup structure * update * address comments * address comments from alamb * fix clippy
Which issue does this PR close?
Rationale for this change
FileGroupencapsulates both files and their statistics, enabling more effective query planning and optimization based on file group characteristics. Moving from rawVec<PartitionedFile>to a dedicated structure creates a proper abstraction that bundles related functionality together.FileGroup's specific API in the future, currently, one that comes into my mind is part of thesplit_groups_by_statisticsmethod, it make any two files in a file group are ordered.file_groupsclearer, frompub file_groups: Vec<Vec<PartitionedFile>>topub file_groups: Vec<FileGroup>What changes are included in this PR?
Introduce
FileGroupand implement the related APIs, replace allVec<PartitionedFile>withFileGroupAre these changes tested?
Yes
Are there any user-facing changes?