[thrift-remodel] Complete decoding of `FileMetaData` and `RowGroupMetaData` #8111

etseidl · 2025-08-11T23:36:46Z

Which issue does this PR close?

Note: this targets a feature branch, not main

Part of Use custom thrift decoder to improve speed of parsing parquet metadata #5854.

Rationale for this change

What changes are included in this PR?

This PR completes reading of the FileMetaData and RowGroupMetaData pieces of the ParquetMetaData. Column indexes and encryption will be follow-on work.

This replaces the macro for generating structs with a more general one that can take visibility and lifetime specifiers. Also (temporarily) adds a new function ParquetMetaDataReader::decode_file_metadata which should be a drop-in replacement for ParquetMetaDataReader::decode_metadata.

Still todo:

Add some tests that verify this produces the same output as ParquetMetaDataReader::decode_metadata
Read column indexes with new decoder
Read page headers with new decoder
Integrate with @alamb's push decoder work [Parquet] Add ParquetMetadataPushDecoder #8080

Are these changes tested?

Not yet

Are there any user-facing changes?

Yes

alamb

I won't pretend to really understand this entire PR, but the basics look good to me

In general, it would be really nice to consolidate the thrift decoding somewhere separate rather than sprinkling it throughout the crate

I wonder if you could implement a Trait or something

pub trait FromThrift {
  /// reads an instance of `self` from the input prot
  fn from_thrift(prot: &mut ThriftCompactInputProtocol) -> Result<Self>;
}

And then you could implement that trait for all the various structures that are read from the header 🤔

alamb · 2025-08-13T18:19:41Z

parquet/src/basic.rs

 }
+);

+thrift_struct!(


this is a pretty sweet way of generating structs BTW

Thanks @jhorstmann! I agree! 😄

alamb · 2025-08-13T18:21:02Z

parquet/src/file/metadata/mod.rs


        #[cfg(not(feature = "encryption"))]
-        let base_expected_size = 2312;
+        let base_expected_size = 2280;


do you know why this changes? It isn't clear to me why the metadata size gets smaller

I haven't tracked it down yet, but I suspect it might be some of the enums shrinking in size (they're back to rust enums rather than structs). I need to add some prints in the HeapSize calls to see exactly what's different.

alamb · 2025-08-13T18:22:46Z

parquet/src/file/metadata/reader.rs

+    pub fn decode_file_metadata(buf: &[u8]) -> Result<ParquetMetaData> {
+        let mut prot = ThriftCompactInputProtocol::new(buf);
+
+        // components of the FileMetaData


I wonder if this would be better / easier to find if it were a method on ParquetMetaData ?

impl ParquetMetaData { fn from_thrift_bytes(&[u8]) -> Result<Self> { ... } }

🤔

I've added this for benchmarking purposes...I think eventually it would simply replace the existing decode_metadata.

alamb · 2025-08-13T18:26:02Z

parquet/src/file/metadata/reader.rs

+            let prot = &mut prot;
+
+            match field_ident.id {
+                1 => {


Is there some way to use one of the magic thrift_struct! macro here rather than hard coding the field ids? It isn't clear to me why this code is different

I've been punting on that for now...I have tried to simplify where I can (such as hiding the complexity of reading lists). The issue here is that the thrift FileMetaData contains the row group metadata, while in ParquetMetaData the crate FileMetaData has the schema and the row group meta is held separately. Similarly, thrift has ColumnChunk that contains ColumnMetaData while we collapse those two structures into a single ColumnChunkMetaData. I can go back to decoding to a private FileMetaData that is then pulled apart (as I've wound up doing for RowGroupMetaData), but was trying to skip that step thinking it would be faster. (For instance...the processing of the schema is quite expensive, so rather than allocating a vector of schema elements, parsing them, and then translating to TypePtr I here pull the schema elements one at a time. That did cut down on the processing time, but by enough to justify the complexity? I'll have to revisit that).

Back to the original question...hand coding is going to have some warts that can't be avoided. There may be a way to pretty it up some where we need custom parsers. Suggestions welcome :D

Back to the original question...hand coding is going to have some warts that can't be avoided. There may be a way to pretty it up some where we need custom parsers. Suggestions welcome :D

I think the key thing in my mind is that if/when something new is added to the spec, there is some straightforward pattern to follow to add support for the new parser

This is now moot...I wound up rolling back to using a temp FileMetaData as well. Since there are no string allocations done, the cost of decoding to the temp structures and then pretty much just taking ownership of their members during conversion winds up being acceptably fast.

alamb · 2025-08-13T18:27:44Z

parquet/src/file/metadata/reader.rs

    }
 }

+// temp structs used to construct RowGroupMetaData


Is there some way to avoid reading into a temp structure and directly construct a RowGroupMetaData?

See above about FileMetaData 😄

I actually went down that road and it was slower. :( That may have been my incompetence, or it may be that the code got too bloated and the optimizer gave up.

alamb · 2025-08-13T18:28:50Z

parquet/src/file/metadata/reader.rs

+}
+);
+
+#[cfg(feature = "encryption")]


Also, I suggest moving the code that does the thrift decoding into a different module from the ParquetMetaDataReader (which is basically a state machine for reading parts of the file into local buffers and decoding it)

Maybe parquet/src/file/metadata/from_thrift.rs 🤔

Yes, I want to consolidate at some point...I'm just not sure where it should live. There's basic already for the low level enums and unions, with structs scattered about where they're used. Pulling them all into a single module (or maybe two) is a good idea.

Now (largely) done. There are some exceptions that I've left as TODOs.

to our own metadata structs. This reduces the amount of non-macro generated code.

…_filemetadata

etseidl · 2025-08-14T21:16:45Z

I wonder if you could implement a Trait or something

Yes, that's the path @jhorstmann took originally, and I'm now seeing the wisdom of it. I'll try that in the next iteration.

etseidl · 2025-08-15T15:07:55Z

Latest run of the metadata bench. Still comparing the old thrift generated code to the new

open(default)           time:   [35.073 µs 35.124 µs 35.179 µs]
open(page index)        time:   [1.7446 ms 1.7524 ms 1.7639 ms]
decode parquet metadata time:   [34.361 µs 34.505 µs 34.735 µs]
decode thrift file metadata
                        time:   [22.239 µs 22.420 µs 22.644 µs]
decode parquet metadata new
                        time:   [19.438 µs 19.486 µs 19.542 µs]
decode parquet metadata (wide)
                        time:   [215.90 ms 216.69 ms 217.55 ms]
decode thrift file metadata (wide)
                        time:   [109.33 ms 109.73 ms 110.14 ms]
decode parquet metadata new (wide)
                        time:   [80.491 ms 80.788 ms 81.093 ms]
page headers            time:   [8.6857 µs 8.7251 µs 8.7688 µs]

No major changes since my last round (#8072 (comment)).

The new code is some 40% faster for the smaller schema, and 63% faster for the wide (1000 columns) schema.

On a herftier machine the speedup isn't quite as dramatic, but still around a 2X improvement.

open(default)           time:   [14.366 µs 14.397 µs 14.432 µs]
open(page index)        time:   [641.58 µs 643.12 µs 644.71 µs]
decode parquet metadata time:   [13.787 µs 13.824 µs 13.862 µs]
decode thrift file metadata
                        time:   [9.6280 µs 9.6709 µs 9.7208 µs]
decode parquet metadata new
                        time:   [6.6013 µs 6.6182 µs 6.6364 µs]
decode parquet metadata (wide)
                        time:   [59.198 ms 59.337 ms 59.494 ms]
decode thrift file metadata (wide)
                        time:   [46.278 ms 46.377 ms 46.481 ms]
decode parquet metadata new (wide)
                        time:   [30.504 ms 30.572 ms 30.646 ms]
page headers            time:   [4.0325 µs 4.0426 µs 4.0539 µs]

I'm going to merge this now and move on to the page indexes. I did a quick test last night and was able to get the 1.75ms for open(page index) down to 850us. There's some inefficiency in the way we decode the page indexes, so I'm going to see if I can cut that time down some more. (Assuming the compiler isn't doing something clever and reusing some vec memory).

alamb · 2025-08-15T15:42:14Z

I did a quick test last night and was able to get the 1.75ms for open(page index) down to 850us.

2x -- not bad

etseidl added 5 commits August 11, 2025 15:29

first cut at reading straight to ParquetMetaData

502c679

metadata got a little smaller

e6cfce1

use macro for SchemaElement

6072693

make names a bit clearer

575a7cf

refactor schema_from_thrift_input

434c5e5

github-actions bot added the parquet Changes to the parquet crate label Aug 11, 2025

etseidl and others added 7 commits August 11, 2025 18:42

add first test

0c3a5b0

make file_path a slice

ae069af

add macro for mixed unions

c9284f2

cleanup and fix union bug

52ad234

list changes and move OrderedF64

cba0561

oops

6c5d58f

replace KeyValue with macro

36450c5

alamb reviewed Aug 13, 2025

View reviewed changes

alamb approved these changes Aug 13, 2025

View reviewed changes

etseidl added 4 commits August 13, 2025 21:06

Go back to first fully decoding thrift, and then transforming

b24b358

to our own metadata structs. This reduces the amount of non-macro generated code.

clippy

cec9815

start moving generated structs to their own module

6538dd4

Merge remote-tracking branch 'origin/main' into redo_filemetadata

8b1cfdb

github-actions bot added the arrow Changes to the arrow crate label Aug 14, 2025

Merge remote-tracking branch 'origin/gh5854_thrift_remodel' into redo…

4fa5167

…_filemetadata

github-actions bot removed the arrow Changes to the arrow crate label Aug 14, 2025

etseidl added 2 commits August 14, 2025 14:06

fix some comments

4a1384e

add some explanation and todos

e497133

etseidl added 4 commits August 14, 2025 16:36

add try_from for ParquetMetaData

8a26462

oof

cefeba4

take ownership of schema elements to avoid clone

d0b6ffc

allow for list of binary with lifetime

390e883

etseidl merged commit f315dbe into apache:gh5854_thrift_remodel Aug 15, 2025
16 checks passed

[thrift-remodel] Complete decoding of FileMetaData and RowGroupMetaData #8111

[thrift-remodel] Complete decoding of FileMetaData and RowGroupMetaData #8111

Uh oh!

Conversation

etseidl commented Aug 11, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl commented Aug 14, 2025

Uh oh!

etseidl commented Aug 15, 2025

Uh oh!

Uh oh!

alamb commented Aug 15, 2025

Uh oh!

Uh oh!

[thrift-remodel] Complete decoding of `FileMetaData` and `RowGroupMetaData` #8111

[thrift-remodel] Complete decoding of `FileMetaData` and `RowGroupMetaData` #8111