-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add doc for the statistics_from_parquet_meta_calc method
#15330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -797,10 +797,34 @@ pub async fn fetch_statistics( | |
| statistics_from_parquet_meta_calc(&metadata, table_schema) | ||
| } | ||
|
|
||
| /// Convert statistics in [`ParquetMetaData`] into [`Statistics`] using ['StatisticsConverter`] | ||
| /// Convert statistics in [`ParquetMetaData`] into [`Statistics`] using [`StatisticsConverter`] | ||
| /// | ||
| /// The statistics are calculated for each column in the table schema | ||
| /// using the row group statistics in the parquet metadata. | ||
| /// | ||
| /// # Key behaviors: | ||
| /// | ||
| /// 1. Extracts row counts and byte sizes from all row groups | ||
| /// 2. Applies schema type coercions to align file schema with table schema | ||
| /// 3. Collects and aggregates statistics across row groups when available | ||
| /// | ||
| /// # When there are no statistics: | ||
| /// | ||
| /// If the Parquet file doesn't contain any statistics (has_statistics is false), the function returns a Statistics object with: | ||
| /// - Exact row count | ||
| /// - Exact byte size | ||
| /// - All column statistics marked as unknown via Statistics::unknown_column(&table_schema) | ||
| /// # When only some columns have statistics: | ||
| /// | ||
| /// For columns with statistics: | ||
| /// - Min/max values are properly extracted and represented as Precision::Exact | ||
| /// - Null counts are calculated by summing across row groups | ||
| /// | ||
| /// For columns without statistics, | ||
| /// - For min/max, there are two situations: | ||
| /// 1. The column isn't in arrow schema, then min/max values are set to Precision::Absent | ||
| /// 2. The column is in arrow schema, but not in parquet schema due to schema revolution, min/max values are set to Precision::Exact(null) | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In fact, I have questions about this behavior, shouldn't it be
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think in this case, the default schema adapter will fill in the constant value null for all columns like this so Precision::Exact(null) is correct However, as @adriangb found in #15263 and elsewhere when users use custom Schema adapters a value other than NULL is filled in Maybe this is another place where the schema adapter could/should be used 🤔
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense, I'll try to make the potential bug surface, thanks @alamb |
||
| /// - Null counts are set to Precision::Exact(num_rows) (conservatively assuming all values could be null) | ||
| pub fn statistics_from_parquet_meta_calc( | ||
| metadata: &ParquetMetaData, | ||
| table_schema: SchemaRef, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍