Skip to content

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

As part of #5853 we are considering ways to improve parquet metadata decoding speed in the context of files with wide schemas (e.g. 10,000 columns)

One common observation is that many queries only require a small subset of the columns, but because of how standard thrift decoders are implemented, they must decode the entire metadata even if only a subset of the columns is needed

Due to the Apache Thrift variable length encoding, the decoder likely still requires scanning the entire metadata, but there is no need to create rust structs for fields that will not be read.

Simply skipping such fields I think would likely result in substantial savings. Some evidence for this is @jhorstmann's prototype that avoids copying structs off the stack results in 2x performance improvements. See #5775 (comment)

Thus we could likely optimize the decoding of metadata for large schemas even more by selectively decoding only the fields needed. This idea also described at a high level here: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4

Describe the solution you'd like
Implement some sort of projection pushdown when decoding metadata. Perhaps we could add a projection argument to this API https://docs.rs/parquet/latest/parquet/file/footer/fn.decode_metadata.html

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions