Selective decoding of a subset (e.g. columns or row groups) of parquet metadata

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

As part of https://github.com/apache/arrow-rs/issues/5853 we are considering ways to improve parquet metadata decoding speed in the context of files with wide schemas (e.g. 10,000 columns)



One common observation is that many queries only require a small subset of the columns, but because of how standard thrift decoders are implemented, they must decode the entire metadata even if only a subset of the columns is needed

Due to the Apache Thrift variable length encoding, the decoder likely still requires scanning the entire metadata, but there is no need to create rust structs for fields that will not be read. 

Simply skipping such fields I think would likely result in substantial savings. Some evidence for this is @jhorstmann's prototype that avoids copying structs off the stack results in 2x performance improvements. See  https://github.com/apache/arrow-rs/issues/5775#issuecomment-2131307588

Thus we could  likely optimize the decoding of  metadata for large schemas even more by selectively decoding only the fields needed. This idea also described at a high level here: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4


**Describe the solution you'd like**
Implement some sort of projection pushdown when decoding metadata. Perhaps we could add a projection argument to  this  API https://docs.rs/parquet/latest/parquet/file/footer/fn.decode_metadata.html

**Describe alternatives you've considered**



**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions