-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers
Description
Is your feature request related to a problem or challenge?
Part of #10922
We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into ArrayRefs -- which will make it significantly easier to use this information for pruning and other tasks.
Describe the solution you'd like
Add support to StatisticsConverter::min_page_statistics and StatisticsConverter::max_page_statistics for the types above
datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 637 to 656 in a923c65
| /// of parquet page [`Index`]'es to an [`ArrayRef`] | |
| pub(crate) fn min_page_statistics<'a, I>( | |
| data_type: Option<&DataType>, | |
| iterator: I, | |
| ) -> Result<ArrayRef> | |
| where | |
| I: Iterator<Item = (usize, &'a Index)>, | |
| { | |
| get_data_page_statistics!(Min, data_type, iterator) | |
| } | |
| /// Extracts the max statistics from an iterator | |
| /// of parquet page [`Index`]'es to an [`ArrayRef`] | |
| pub(crate) fn max_page_statistics<'a, I>( | |
| data_type: Option<&DataType>, | |
| iterator: I, | |
| ) -> Result<ArrayRef> | |
| where | |
| I: Iterator<Item = (usize, &'a Index)>, | |
| { |
Describe alternatives you've considered
You can follow the model from @Weijun-H in #10931
- Update the test for the listed data types to be
Check::Both, following the model oftest_int64datafusion/datafusion/core/tests/parquet/arrow_statistics.rs
Lines 506 to 529 in a923c65
async fn test_int_64() { // This creates a parquet files of 4 columns named "i8", "i16", "i32", "i64" let reader = TestReader { scenario: Scenario::Int, row_per_group: 5, } .build() .await; // since each row has only one data page, the statistics are the same Test { reader: &reader, // mins are [-5, -4, 0, 5] expected_min: Arc::new(Int64Array::from(vec![-5, -4, 0, 5])), // maxes are [-1, 0, 4, 9] expected_max: Arc::new(Int64Array::from(vec![-1, 0, 4, 9])), // nulls are [0, 0, 0, 0] expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0]), // row counts are [5, 5, 5, 5] expected_row_counts: UInt64Array::from(vec![5, 5, 5, 5]), column_name: "i64", check: Check::Both, } .run(); - Add any required implementation in
get_datapage_statistics:(follow the model of the row counts,macro_rules! get_data_page_statistics { )macro_rules! make_stats_iterator {
Typically the change to the test looks like
- check: Check::RowGroup,
+ check: Check::Both, Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers