Skip to content

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 11, 2025

Which issue does this PR close?

Note: this targets a feature branch, not main

Rationale for this change

What changes are included in this PR?

This PR completes reading of the FileMetaData and RowGroupMetaData pieces of the ParquetMetaData. Column indexes and encryption will be follow-on work.

This replaces the macro for generating structs with a more general one that can take visibility and lifetime specifiers. Also (temporarily) adds a new function ParquetMetaDataReader::decode_file_metadata which should be a drop-in replacement for ParquetMetaDataReader::decode_metadata.

Still todo:

  1. Add some tests that verify this produces the same output as ParquetMetaDataReader::decode_metadata
  2. Read column indexes with new decoder
  3. Read page headers with new decoder
  4. Integrate with @alamb's push decoder work [Parquet] Add ParquetMetadataPushDecoder #8080

Are these changes tested?

Not yet

Are there any user-facing changes?

Yes

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 11, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't pretend to really understand this entire PR, but the basics look good to me

In general, it would be really nice to consolidate the thrift decoding somewhere separate rather than sprinkling it throughout the crate

I wonder if you could implement a Trait or something

pub trait FromThrift {
  /// reads an instance of `self` from the input prot
  fn from_thrift(prot: &mut ThriftCompactInputProtocol) -> Result<Self>;
}

And then you could implement that trait for all the various structures that are read from the header 🤔

}
);

thrift_struct!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty sweet way of generating structs BTW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jhorstmann! I agree! 😄


#[cfg(not(feature = "encryption"))]
let base_expected_size = 2312;
let base_expected_size = 2280;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why this changes? It isn't clear to me why the metadata size gets smaller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tracked it down yet, but I suspect it might be some of the enums shrinking in size (they're back to rust enums rather than structs). I need to add some prints in the HeapSize calls to see exactly what's different.

pub fn decode_file_metadata(buf: &[u8]) -> Result<ParquetMetaData> {
let mut prot = ThriftCompactInputProtocol::new(buf);

// components of the FileMetaData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would be better / easier to find if it were a method on ParquetMetaData ?

impl ParquetMetaData {
  fn from_thrift_bytes(&[u8]) -> Result<Self> {
...
  }
}

🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this for benchmarking purposes...I think eventually it would simply replace the existing decode_metadata.

let prot = &mut prot;

match field_ident.id {
1 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to use one of the magic thrift_struct! macro here rather than hard coding the field ids? It isn't clear to me why this code is different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been punting on that for now...I have tried to simplify where I can (such as hiding the complexity of reading lists). The issue here is that the thrift FileMetaData contains the row group metadata, while in ParquetMetaData the crate FileMetaData has the schema and the row group meta is held separately. Similarly, thrift has ColumnChunk that contains ColumnMetaData while we collapse those two structures into a single ColumnChunkMetaData. I can go back to decoding to a private FileMetaData that is then pulled apart (as I've wound up doing for RowGroupMetaData), but was trying to skip that step thinking it would be faster. (For instance...the processing of the schema is quite expensive, so rather than allocating a vector of schema elements, parsing them, and then translating to TypePtr I here pull the schema elements one at a time. That did cut down on the processing time, but by enough to justify the complexity? I'll have to revisit that).

Back to the original question...hand coding is going to have some warts that can't be avoided. There may be a way to pretty it up some where we need custom parsers. Suggestions welcome :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to the original question...hand coding is going to have some warts that can't be avoided. There may be a way to pretty it up some where we need custom parsers. Suggestions welcome :D

I think the key thing in my mind is that if/when something new is added to the spec, there is some straightforward pattern to follow to add support for the new parser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now moot...I wound up rolling back to using a temp FileMetaData as well. Since there are no string allocations done, the cost of decoding to the temp structures and then pretty much just taking ownership of their members during conversion winds up being acceptably fast.

}
}

// temp structs used to construct RowGroupMetaData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to avoid reading into a temp structure and directly construct a RowGroupMetaData?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above about FileMetaData 😄

I actually went down that road and it was slower. :( That may have been my incompetence, or it may be that the code got too bloated and the optimizer gave up.

}
);

#[cfg(feature = "encryption")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I suggest moving the code that does the thrift decoding into a different module from the ParquetMetaDataReader (which is basically a state machine for reading parts of the file into local buffers and decoding it)

Maybe parquet/src/file/metadata/from_thrift.rs 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I want to consolidate at some point...I'm just not sure where it should live. There's basic already for the low level enums and unions, with structs scattered about where they're used. Pulling them all into a single module (or maybe two) is a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now (largely) done. There are some exceptions that I've left as TODOs.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 14, 2025
@github-actions github-actions bot removed the arrow Changes to the arrow crate label Aug 14, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Aug 14, 2025

I wonder if you could implement a Trait or something

Yes, that's the path @jhorstmann took originally, and I'm now seeing the wisdom of it. I'll try that in the next iteration.

@etseidl
Copy link
Contributor Author

etseidl commented Aug 15, 2025

Latest run of the metadata bench. Still comparing the old thrift generated code to the new

open(default)           time:   [35.073 µs 35.124 µs 35.179 µs]
open(page index)        time:   [1.7446 ms 1.7524 ms 1.7639 ms]
decode parquet metadata time:   [34.361 µs 34.505 µs 34.735 µs]
decode thrift file metadata
                        time:   [22.239 µs 22.420 µs 22.644 µs]
decode parquet metadata new
                        time:   [19.438 µs 19.486 µs 19.542 µs]
decode parquet metadata (wide)
                        time:   [215.90 ms 216.69 ms 217.55 ms]
decode thrift file metadata (wide)
                        time:   [109.33 ms 109.73 ms 110.14 ms]
decode parquet metadata new (wide)
                        time:   [80.491 ms 80.788 ms 81.093 ms]
page headers            time:   [8.6857 µs 8.7251 µs 8.7688 µs]

No major changes since my last round (#8072 (comment)).

The new code is some 40% faster for the smaller schema, and 63% faster for the wide (1000 columns) schema.

On a herftier machine the speedup isn't quite as dramatic, but still around a 2X improvement.

open(default)           time:   [14.366 µs 14.397 µs 14.432 µs]
open(page index)        time:   [641.58 µs 643.12 µs 644.71 µs]
decode parquet metadata time:   [13.787 µs 13.824 µs 13.862 µs]
decode thrift file metadata
                        time:   [9.6280 µs 9.6709 µs 9.7208 µs]
decode parquet metadata new
                        time:   [6.6013 µs 6.6182 µs 6.6364 µs]
decode parquet metadata (wide)
                        time:   [59.198 ms 59.337 ms 59.494 ms]
decode thrift file metadata (wide)
                        time:   [46.278 ms 46.377 ms 46.481 ms]
decode parquet metadata new (wide)
                        time:   [30.504 ms 30.572 ms 30.646 ms]
page headers            time:   [4.0325 µs 4.0426 µs 4.0539 µs]

I'm going to merge this now and move on to the page indexes. I did a quick test last night and was able to get the 1.75ms for open(page index) down to 850us. There's some inefficiency in the way we decode the page indexes, so I'm going to see if I can cut that time down some more. (Assuming the compiler isn't doing something clever and reusing some vec memory).

@etseidl etseidl merged commit f315dbe into apache:gh5854_thrift_remodel Aug 15, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 15, 2025

I did a quick test last night and was able to get the 1.75ms for open(page index) down to 850us.

2x -- not bad

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants