-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Use any instead of for_each
#15289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use any instead of for_each
#15289
Conversation
berkaysynnada
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of either "try for all" or "skip at all", isn't better to only go over the columns which has statistics.is_some() ?
Do you mean the following? // Instead of a single boolean flag
let columns_with_statistics: Vec<_> = row_group_meta.columns().iter()
.enumerate()
.filter_map(|(idx, column)| {
if column.statistics().is_some() {
Some(idx)
} else {
None
}
})
.collect();
// Then later, when processing column statistics
for (col_idx, field) in table_schema.fields().iter().enumerate() {
if columns_with_statistics.contains(&col_idx) {
// Process only columns that have statistics
// Use the statistics converter and update min/max accumulators
}
} |
yep, even withing a single iteration after zipping |
any instead of for_eachstatistics_from_parquet_meta_calc
|
I have some new thoughts for the PR, still not ready to review. |
In fact, even if a column in |
statistics_from_parquet_meta_calcany instead of for_each
| total_byte_size += row_group_meta.total_byte_size() as usize; | ||
|
|
||
| if !has_statistics { | ||
| row_group_meta.columns().iter().for_each(|column| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key issue is how has_statistics is assigned within the for_each loop. During iteration, the value of has_statistics is overwritten by each column's statistics status.
Example Scenario
Assume there are 2 columns with the following statistics status:
- Column 1: Has statistics (statistics is Some)
- Column 2: No statistics (statistics is None)
Execution process:
- Processing Column 1: has_statistics becomes true
- Processing Column 2: has_statistics becomes false
if the last column processed has no statistics, the final value of has_statistics would be false.
if the goal is to check "whether any column has statistics", a more suitable approach would be to use any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I wasn't oppose to your change. I was just thinking a further refactor of unifying these two iterations:
row_group_meta.columns().iter().for_each(|column| {
has_statistics = column.statistics().is_some();
});and
table_schema
.fields()
.iter()
.enumerate()
.for_each(|(idx, field)| {
...but maybe not worth to try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because even if a column in table_schema doesn't have the corresponding statistics in row_group_meta, it also may need to go summarize_min_max_null_counts to set statistics with null., I think it isn't worth trying.
In fact, I have tried it, and it doesn't make much sense, they (row_group_meta.columns() and table_schema.fields())are not always aligned.
|
Thanks for your review @berkaysynnada @alamb |
Which issue does this PR close?
Rationale for this change
The statistics are calculated only for columns in the table schema that actually have statistics available in the parquet metadata, avoiding unnecessary work.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?