-
Couldn't load subscription status.
- Fork 1.7k
Description
Is your feature request related to a problem or challenge?
While working on #10926 @marvinlanhenke has an excellent question:
Why do we return row counts for a non existing column in row_group_row_counts?
(this was because there was some inconsistency between data pages and row counts)
The reason StatisticsConverter::row_group_counts returns row counts even for a non existent column is because it is the API needed for PruningStatistics here:
datafusion/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs
Lines 386 to 390 in a923c65
| fn row_counts(&self, _column: &Column) -> Option<ArrayRef> { | |
| // row counts are the same for all columns in a row group | |
| StatisticsConverter::row_group_row_counts(self.metadata_iter()) | |
| .ok() | |
| .map(|counts| Arc::new(counts) as ArrayRef) |
It is possible to do because the ParquetMetadata knows how many row groups there are even when there are no row group statistics, but it doesn't make logical sense. Furthermore, for data pages, it is different if the "page index" is not present as then it doesn't even make sense to ask how many rows are in each data page as we don't have any data pages
Thus I think row_group_row_counts should also default to returning None if the column is not present, as @marvinlanhenke has done for null_counts_page_statistics in
So I guess my new proposal would be to return Option like:
impl<'a> StatisticsConverter<'a> {
...
/// return OK(None) if the column does not exist
pub(crate) fn null_counts_page_statistics<'a, I>(iterator: I) -> Result<Option<UInt64Array>> {
...
}
/// return OK(None) if the column does not exist
pub fn data_page_row_counts<I>(
&self,
column_offset_index: &ParquetOffsetIndex,
row_group_metadatas: &[RowGroupMetaData],
row_group_indices: I,
) -> Result<Option<UInt64Array>>
where
I: IntoIterator<Item = &'a usize>,
{
...
}
}The rationale to return an Option rather than an error is that creating and ignoring DataFusionError via ok() still requires a string allocation, which is wasteful
I realize this is done many places already in the statistics extraction code, but I think for those cases it is meant to make the code resilent to incorrectly encoded parquet files rather than something that is "expected" to happen
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response