Skip to content

DataFusion ignores "column order" parquet statistics specification #10586

@alamb

Description

@alamb

Describe the bug

As @tustvold points out, there is a column_order API defined in parquet that is currently entirely ignored by DataFusion

It is not entirely clear to me what the implications of ignoring this field are or what other parquet writers populate it with, but we should probably not ignore it

To Reproduce

No response

Expected behavior

No response

Additional context

To emphasise the point I made when this API was originally proposed, you need more than just the ParquetStatistics in order to correctly interpret the data. You need at least the FileMetadata to get the https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order in order to be able to even interpret what the statistics mean for a given column.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions