Use `any` instead of `for_each` #15289

xudong963 · 2025-03-18T06:30:21Z

Which issue does this PR close?

Closes #.

Rationale for this change

The statistics are calculated only for columns in the table schema that actually have statistics available in the parquet metadata, avoiding unnecessary work.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

berkaysynnada

Instead of either "try for all" or "skip at all", isn't better to only go over the columns which has statistics.is_some() ?

xudong963 · 2025-03-18T08:10:16Z

Instead of either "try for all" or "skip at all", isn't better to only go over the columns which has statistics.is_some() ?

Do you mean the following?

// Instead of a single boolean flag
let columns_with_statistics: Vec<_> = row_group_meta.columns().iter()
    .enumerate()
    .filter_map(|(idx, column)| {
        if column.statistics().is_some() {
            Some(idx)
        } else {
            None
        }
    })
    .collect();

// Then later, when processing column statistics
for (col_idx, field) in table_schema.fields().iter().enumerate() {
    if columns_with_statistics.contains(&col_idx) {
        // Process only columns that have statistics
        // Use the statistics converter and update min/max accumulators
    } 
}

berkaysynnada · 2025-03-18T08:22:10Z

Instead of either "try for all" or "skip at all", isn't better to only go over the columns which has statistics.is_some() ?

Do you mean the following?

yep, even withing a single iteration after zipping

datafusion/datasource-parquet/src/file_format.rs

xudong963 · 2025-03-19T16:56:23Z

I have some new thoughts for the PR, still not ready to review.

xudong963 · 2025-03-21T09:31:52Z

Instead of either "try for all" or "skip at all", isn't better to only go over the columns which has statistics.is_some() ?

In fact, even if a column in table_schema doesn't have the corresponding statistics in row_group_meta, it also may need to go summarize_min_max_null_counts to set statistics with null.

xudong963 · 2025-03-21T09:38:19Z

datafusion/datasource-parquet/src/file_format.rs

        total_byte_size += row_group_meta.total_byte_size() as usize;

        if !has_statistics {
-            row_group_meta.columns().iter().for_each(|column| {


The key issue is how has_statistics is assigned within the for_each loop. During iteration, the value of has_statistics is overwritten by each column's statistics status.

Example Scenario

Assume there are 2 columns with the following statistics status:

Column 1: Has statistics (statistics is Some)

Column 2: No statistics (statistics is None)

Execution process:

Processing Column 1: has_statistics becomes true

Processing Column 2: has_statistics becomes false

if the last column processed has no statistics, the final value of has_statistics would be false.

if the goal is to check "whether any column has statistics", a more suitable approach would be to use any

You are right, I wasn't oppose to your change. I was just thinking a further refactor of unifying these two iterations:

row_group_meta.columns().iter().for_each(|column| { has_statistics = column.statistics().is_some(); });

and

table_schema .fields() .iter() .enumerate() .for_each(|(idx, field)| { ...

but maybe not worth to try

Yes, because even if a column in table_schema doesn't have the corresponding statistics in row_group_meta, it also may need to go summarize_min_max_null_counts to set statistics with null., I think it isn't worth trying.

In fact, I have tried it, and it doesn't make much sense, they (row_group_meta.columns() and table_schema.fields())are not always aligned.

xudong963 · 2025-03-21T17:02:35Z

Thanks for your review @berkaysynnada @alamb

berkaysynnada reviewed Mar 18, 2025

View reviewed changes

xudong963 changed the title ~~Use any instead of for_each~~ Improve the calculation of statistics in statistics_from_parquet_meta_calc Mar 18, 2025

xudong963 marked this pull request as draft March 18, 2025 09:56

xudong963 force-pushed the refactor_2 branch from b5269f5 to d626c85 Compare March 19, 2025 09:26

github-actions bot added the functions Changes to functions implementation label Mar 19, 2025

xudong963 commented Mar 19, 2025

View reviewed changes

datafusion/datasource-parquet/src/file_format.rs Outdated Show resolved Hide resolved

xudong963 marked this pull request as ready for review March 19, 2025 09:32

xudong963 marked this pull request as draft March 19, 2025 16:55

xudong963 mentioned this pull request Mar 20, 2025

Add doc for the statistics_from_parquet_meta_calc method #15330

Merged

xudong963 force-pushed the refactor_2 branch from d626c85 to 447b478 Compare March 21, 2025 09:25

github-actions bot removed the functions Changes to functions implementation label Mar 21, 2025

Use any instead of for_each

e1b5032

xudong963 force-pushed the refactor_2 branch from 447b478 to e1b5032 Compare March 21, 2025 09:26

xudong963 changed the title ~~Improve the calculation of statistics in statistics_from_parquet_meta_calc~~ Use any instead of for_each Mar 21, 2025

xudong963 commented Mar 21, 2025

View reviewed changes

xudong963 marked this pull request as ready for review March 21, 2025 09:38

xudong963 requested a review from alamb March 21, 2025 09:39

berkaysynnada approved these changes Mar 21, 2025

View reviewed changes

alamb approved these changes Mar 21, 2025

View reviewed changes

xudong963 merged commit 7c902de into apache:main Mar 21, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use `any` instead of `for_each` #15289

Use `any` instead of `for_each` #15289

Uh oh!

xudong963 commented Mar 18, 2025 •

edited

Loading

Uh oh!

berkaysynnada left a comment

Uh oh!

xudong963 commented Mar 18, 2025

Uh oh!

berkaysynnada commented Mar 18, 2025

Uh oh!

Uh oh!

xudong963 commented Mar 19, 2025

Uh oh!

xudong963 commented Mar 21, 2025

Uh oh!

xudong963 Mar 21, 2025

Uh oh!

berkaysynnada Mar 21, 2025

Uh oh!

xudong963 Mar 21, 2025 •

edited

Loading

Uh oh!

xudong963 commented Mar 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use any instead of for_each #15289

Use any instead of for_each #15289

Uh oh!

Conversation

xudong963 commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

berkaysynnada left a comment

Choose a reason for hiding this comment

Uh oh!

xudong963 commented Mar 18, 2025

Uh oh!

berkaysynnada commented Mar 18, 2025

Uh oh!

Uh oh!

xudong963 commented Mar 19, 2025

Uh oh!

xudong963 commented Mar 21, 2025

Uh oh!

xudong963 Mar 21, 2025

Choose a reason for hiding this comment

Example Scenario

Uh oh!

berkaysynnada Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

xudong963 Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xudong963 commented Mar 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use `any` instead of `for_each` #15289

Use `any` instead of `for_each` #15289

xudong963 commented Mar 18, 2025 •

edited

Loading

xudong963 Mar 21, 2025 •

edited

Loading