Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions datafusion/datasource-parquet/src/file_format.rs
Original file line number Diff line number Diff line change
Expand Up @@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc(
total_byte_size += row_group_meta.total_byte_size() as usize;

if !has_statistics {
row_group_meta.columns().iter().for_each(|column| {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key issue is how has_statistics is assigned within the for_each loop. During iteration, the value of has_statistics is overwritten by each column's statistics status.

Example Scenario

Assume there are 2 columns with the following statistics status:

  • Column 1: Has statistics (statistics is Some)
  • Column 2: No statistics (statistics is None)

Execution process:

  • Processing Column 1: has_statistics becomes true
  • Processing Column 2: has_statistics becomes false

if the last column processed has no statistics, the final value of has_statistics would be false.

if the goal is to check "whether any column has statistics", a more suitable approach would be to use any

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I wasn't oppose to your change. I was just thinking a further refactor of unifying these two iterations:

  row_group_meta.columns().iter().for_each(|column| {
               has_statistics = column.statistics().is_some();
           });

and

      table_schema
          .fields()
          .iter()
          .enumerate()
          .for_each(|(idx, field)| {
          ...

but maybe not worth to try

Copy link
Member Author

@xudong963 xudong963 Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because even if a column in table_schema doesn't have the corresponding statistics in row_group_meta, it also may need to go summarize_min_max_null_counts to set statistics with null., I think it isn't worth trying.

In fact, I have tried it, and it doesn't make much sense, they (row_group_meta.columns() and table_schema.fields())are not always aligned.

has_statistics = column.statistics().is_some();
});
has_statistics = row_group_meta
.columns()
.iter()
.any(|column| column.statistics().is_some());
}
}
statistics.num_rows = Precision::Exact(num_rows);
Expand Down