Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion datafusion/datasource-parquet/src/file_format.rs
Original file line number Diff line number Diff line change
Expand Up @@ -797,10 +797,34 @@ pub async fn fetch_statistics(
statistics_from_parquet_meta_calc(&metadata, table_schema)
}

/// Convert statistics in [`ParquetMetaData`] into [`Statistics`] using ['StatisticsConverter`]
/// Convert statistics in [`ParquetMetaData`] into [`Statistics`] using [`StatisticsConverter`]
///
/// The statistics are calculated for each column in the table schema
/// using the row group statistics in the parquet metadata.
///
/// # Key behaviors:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

///
/// 1. Extracts row counts and byte sizes from all row groups
/// 2. Applies schema type coercions to align file schema with table schema
/// 3. Collects and aggregates statistics across row groups when available
///
/// # When there are no statistics:
///
/// If the Parquet file doesn't contain any statistics (has_statistics is false), the function returns a Statistics object with:
/// - Exact row count
/// - Exact byte size
/// - All column statistics marked as unknown via Statistics::unknown_column(&table_schema)
/// # When only some columns have statistics:
///
/// For columns with statistics:
/// - Min/max values are properly extracted and represented as Precision::Exact
/// - Null counts are calculated by summing across row groups
///
/// For columns without statistics,
/// - For min/max, there are two situations:
/// 1. The column isn't in arrow schema, then min/max values are set to Precision::Absent
/// 2. The column is in arrow schema, but not in parquet schema due to schema revolution, min/max values are set to Precision::Exact(null)
Copy link
Member Author

@xudong963 xudong963 Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I have questions about this behavior, shouldn't it be Precision::Absent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case, the default schema adapter will fill in the constant value null for all columns like this so Precision::Exact(null) is correct

However, as @adriangb found in #15263 and elsewhere when users use custom Schema adapters a value other than NULL is filled in

Maybe this is another place where the schema adapter could/should be used 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I'll try to make the potential bug surface, thanks @alamb

/// - Null counts are set to Precision::Exact(num_rows) (conservatively assuming all values could be null)
pub fn statistics_from_parquet_meta_calc(
metadata: &ParquetMetaData,
table_schema: SchemaRef,
Expand Down