Skip to content

Change StatisticsConverter::row_group_counts to return None for non existent columns in parquet files #10965

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

While working on #10926 @marvinlanhenke has an excellent question:

Why do we return row counts for a non existing column in row_group_row_counts?

(this was because there was some inconsistency between data pages and row counts)

The reason StatisticsConverter::row_group_counts returns row counts even for a non existent column is because it is the API needed for PruningStatistics here:

fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {
// row counts are the same for all columns in a row group
StatisticsConverter::row_group_row_counts(self.metadata_iter())
.ok()
.map(|counts| Arc::new(counts) as ArrayRef)

It is possible to do because the ParquetMetadata knows how many row groups there are even when there are no row group statistics, but it doesn't make logical sense. Furthermore, for data pages, it is different if the "page index" is not present as then it doesn't even make sense to ask how many rows are in each data page as we don't have any data pages

Thus I think row_group_row_counts should also default to returning None if the column is not present, as @marvinlanhenke has done for null_counts_page_statistics in

So I guess my new proposal would be to return Option like:

impl<'a> StatisticsConverter<'a> {
...
  /// return OK(None) if the column does not exist
  pub(crate) fn null_counts_page_statistics<'a, I>(iterator: I) -> Result<Option<UInt64Array>> {
  ...
  }

  /// return OK(None) if the column does not exist
    pub fn data_page_row_counts<I>(
        &self,
        column_offset_index: &ParquetOffsetIndex,
        row_group_metadatas: &[RowGroupMetaData],
        row_group_indices: I,
    ) -> Result<Option<UInt64Array>>
    where
        I: IntoIterator<Item = &'a usize>,
    {
    ...
    }

}

The rationale to return an Option rather than an error is that creating and ignoring DataFusionError via ok() still requires a string allocation, which is wasteful

I realize this is done many places already in the statistics extraction code, but I think for those cases it is meant to make the code resilent to incorrectly encoded parquet files rather than something that is "expected" to happen

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions