Complete phase 1 of the Thrift remodel for the parquet crate #8530

etseidl · 2025-10-01T21:07:41Z

Which issue does this PR close?

Rationale for this change

See issue, but this change is needed to allow greater control over Parquet metadata handling. These changes speed up Thrift decoding about 2X, and enable further optimizations down the line, such as allowing selective decoding of only that metadata that is needed.

What changes are included in this PR?

In short, the format crate is now no longer necessary. This crate will decode from Thrift encoded bytes straight to the metadata objects defined by this crate (ParquetMetaData and children).

Are these changes tested?

Yes, but more should be added by follow-on PRs

Are there any user-facing changes?

Yes, many.

…8048) **Note: this targets a feature branch, not main** # Which issue does this PR close? - Part of #5854. - related to #6129 # Rationale for this change This is the first step towards a rework of Parquet metadata handling. This PR attempts to remove as many references as possible to the `parquet::format` module in the public API. This is done by creating new enums and structs that mirror their `format` counterparts and using them in publicly exposed structures like `FileMetaData`. # What changes are included in this PR? # Are these changes tested? Current tests should suffice for now. More thorough tests will be added as needed. # Are there any user-facing changes? Yes, public facing interfaces should no longer expose `format`

@jhorstmann

# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. - Related to #6129 # Rationale for this change Next step. # What changes are included in this PR? Attempt to make use of macros originally developed by @jhorstmann to implement some thrift enums and unions in `basic.rs`. Also adds yet another spin on a thrift decoder based heavily on `TCompactSliceInputProtocol`. The current approach is to use this new decoder with `TryFrom` impls on the data structures in question. This PR does not yet complete the process far enough to parse a parquet footer, but I'm putting it out here to get early feedback on the design. Some structures already deviate enough from their thrift counterparts that the macro based parsing is not practical. Likewise, the macro approach doesn't work for structs that need lifetime annotations (necessary for reading binary fields as slices). # Are these changes tested? Not yet. # Are there any user-facing changes? Yes.

@alamb

…aData` (#8111) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change # What changes are included in this PR? This PR completes reading of the `FileMetaData` and `RowGroupMetaData` pieces of the `ParquetMetaData`. Column indexes and encryption will be follow-on work. This replaces the macro for generating structs with a more general one that can take visibility and lifetime specifiers. Also (temporarily) adds a new function `ParquetMetaDataReader::decode_file_metadata` which should be a drop-in replacement for `ParquetMetaDataReader::decode_metadata`. Still todo: 1. Add some tests that verify this produces the same output as `ParquetMetaDataReader::decode_metadata` 2. Read column indexes with new decoder 3. Read page headers with new decoder 4. Integrate with @alamb's push decoder work #8080 # Are these changes tested? Not yet # Are there any user-facing changes? Yes

# Which issue does this PR close? **Note: this targets a feature branch, not main** We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of #5854. # Rationale for this change Speed # What changes are included in this PR? Still a work in progress, but begins the process of converting page index parsing to the new thrift decoder. # Are these changes tested? This PR actually uses the new decoder when parsing the page indexes using the existing machinery. As such all tests involving the page indexes should apply to this code. # Are there any user-facing changes? Yes

# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change Parsing the column index is _very_ slow. The largest part of the cost is taking the thrift structure (which is a struct of arrays) and converting it to an array of structs. This results in a large number of allocations when dealing with binary columns. This is an experiment in creating a new structure to hold the column index info that is a little friendlier to parse. It may also be easier to consume on the datafusion side. # What changes are included in this PR? A new `ColumnIndexMetaData` enum is added along with a type parameterized `NativeColumnIndex` struct. # Are these changes tested? No, this is an experiment only. If this work can be honed into an acceptible `Index` replacement, then tests will be added at that time. # Are there any user-facing changes? Yes, this would be a radical change to the column indexes in `ParquetMetaData`.

…ng of page indexes (#8190) # Which issue does this PR close? **Note: this targets a feature branch, not main** We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of #5854. # Rationale for this change Add a custom parser for `PageLocation` as the decoding of this struct is one of several hot spots. # What changes are included in this PR? This adds a faster means of obtaining the struct field ids to `ThriftCompactInputProtocol`. For a small struct (3 fields) with all of them required, we can save a good bit of time bypassing `ThriftCompactInputProtocol::read_field_begin` which is very general and can handle out-of-order fields, among other things. By adding a new function `read_field_header`, we can avoid the costly branching that occurs when calculating the new field id (as well as special handling needed for boolean fields). Field validation is then handled on the consuming side while decoding the `PageLocation` struct. Note that to obtain the speed up seen, we need to assume the fields will always be in order, and the field ids will all be encoded as field deltas. This is probably a fairly safe assumption, but there does exist the possibility of custom thrift writers that use absolute field ids. If we encounter such a writer in the wild, this change will need to be reverted. # Are these changes tested? These changes should be covered by existing changes. # Are there any user-facing changes? None beyond the changes in this branch.

# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change Begins adding custom thrift write support. # What changes are included in this PR? Adds traits to aid in writing of thrift and modifies thrift macros to generate writing code. # Are these changes tested? Yes, adds some roundtrip tests to validate encoded data can be decoded to the same state. # Are there any user-facing changes? No

…ter decryption code (#8313) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change Continuing the remodel # What changes are included in this PR? This begins the process of replacing the current footer parsing code with the new version. As part of this much of the decryption machinery also needed to be changed. # Are these changes tested? Should be covered by existing tests # Are there any user-facing changes? Yes

# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change As I started on decoding thrift page headers, I found that the way I had been going was no longer going to work. This PR begins the process of abstracting the thrift reader to allow for other implementations. # What changes are included in this PR? In addition to reworking the reader itself, this PR moves away from the previous `TryFrom` approach and instead adds a `ReadThrift` trait. # Are these changes tested? Should be covered by existing tests # Are there any user-facing changes? Yes

…ers (#8376) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change This continues the remodel, moving on to `PageHeader` support. # What changes are included in this PR? Swaps out old `format` page header structs for new ones. This also adds a `Read` based implementation of the thrift compact protocol reader (the sizes of the Thrift encoded page headers are not knowable in advance, so we need a way to read them from the thrift input stream used by the page decoder). This PR also makes decoding of the `Statistics` in the page header optional (defaults to `false`). We do not use them, and the decode takes a good chunk of time. # Are these changes tested? These changes should be covered by existing tests # Are there any user-facing changes? Yes, page level stats are no longer decoded by default

# Which issue does this PR close? **Note: this targets a feature branch, not main** - Closes #5854. # Rationale for this change Continues the remodel by implementing writing of the page index structures. # What changes are included in this PR? This PR removes the old `parquet::file::page_index::Index` enum and replaces with the new `ColumnIndexMetaData` struct. # Are these changes tested? Covered by existing tests # Are there any user-facing changes? Yes.

…8476) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change This should complete the major work of the remodel. Follow-on tasks include performance tweaks, documentation, and adding the ability to skip unneeded structures when decoding. None of the latter should involve breaking changes. # What changes are included in this PR? The major change is removing all of the code to convert to `parquet::format` structures. The only places those are still used are in the benchmark and `bin` code which are not strictly in the parquet crate. Once those are cleaned up we can deprecate the `format` module. This also adds a markdown file documenting the use of the new Thrift macros. # Are these changes tested? Should be covered by existing tests. # Are there any user-facing changes? Yes --------- Co-authored-by: Matthijs Brobbel <[email protected]>

etseidl · 2025-10-01T21:17:46Z

Most recent benchmark (on Core i7 macbook)

group                             gh5854                                 main
-----                             ------                                 ----
decode parquet metadata           1.00     19.5±0.37µs        ? ?/sec    1.83     35.7±0.74µs        ? ?/sec
decode parquet metadata (wide)    1.00     90.7±3.11ms        ? ?/sec    2.51    227.8±2.72ms        ? ?/sec
open(default)                     1.00     21.2±0.64µs        ? ?/sec    1.72     36.3±0.66µs        ? ?/sec
open(page index)                  1.00    257.5±5.27µs        ? ?/sec    6.84  1761.1±29.66µs        ? ?/sec

etseidl · 2025-10-01T21:28:20Z

Benchmark on Intel i7-12700K

group                             gh5854                                 main
-----                             ------                                 ----
decode parquet metadata           1.00      6.6±0.06µs        ? ?/sec    1.94     12.9±0.20µs        ? ?/sec
decode parquet metadata (wide)    1.00     33.0±0.38ms        ? ?/sec    2.23     73.7±1.54ms        ? ?/sec
open(default)                     1.00      7.1±0.09µs        ? ?/sec    1.91     13.5±0.50µs        ? ?/sec
open(page index)                  1.00    104.7±1.46µs        ? ?/sec    5.99    627.9±7.33µs        ? ?/sec

@paleolimbot

…8528) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change This brings over changes to handling of geo-spatial statistics introduced by @paleolimbot in #8520. # What changes are included in this PR? Primarily adds documentation and tests to changes already made. The only significant change is adding a `Default` implementation for `EdgeInterpolationAlgorithm`. # Are these changes tested? Yes # Are there any user-facing changes? Yes --------- Co-authored-by: Matthijs Brobbel <[email protected]>

alamb · 2025-10-02T18:20:04Z

FYI I am starting to test this branch with DataFusion but it will take me a while as the arrow upgrade is quite substantial due to how DataTypes are represented

Test thrift-remodel branch with DataFusion #8513 (comment)

alamb · 2025-10-02T18:21:15Z

Most recent benchmark (on Core i7 macbook)

alamb · 2025-10-04T10:23:45Z

I got far enough to start testing DataFusion with this actual branch. I am not done yet but making good progress #8513 (comment)

alamb · 2025-10-06T19:08:45Z

Updating this branch to get the latest fixes (so I can get the DataFusion tests to pass)

alamb · 2025-10-06T20:52:12Z

Update on testing: I haven't found any API problems yet in DataFusion (I had to make some changes, but it was all manageable)

I am still reviewing / updating tests, but so far it looks good to me. I hope to be finished tomorrow

alamb · 2025-10-07T17:55:45Z

Update is I have completed my testing (see #8513 (comment) for details)

Performance looks good (a modest improvement for some queries and no regressions)
All functional tests pass ✅

My conclusion is that the thrift-remodel branch is ready to merge to main

etseidl · 2025-10-07T18:17:45Z

Thanks for all the testing @alamb! I've run through my stash of parquet files and have found no issues as well. I'm marking this read for review.

etseidl and others added 30 commits August 6, 2025 14:36

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

a69c578

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

0d2f33e

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

2fcbca9

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

569411b

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

8e5ea14

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

ddcb89c

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

d07708c

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

9fbe80a

finish merge

64a59c3

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

9596775

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

50e78e4

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

bbed01d

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

c4967d0

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

8dbfe4c

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

67576bf

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

bd6d1da

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

d5005e2

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

94306a5

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

50b8d35

etseidl and others added 7 commits September 26, 2025 09:24

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

300aa2a

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

1365ec2

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

b4b4d26

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

2a9b061

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

77a6cf7

Merge remote-tracking branch 'origin/main' into gh5854_thrift_remodel

3dafbe0

etseidl added this to the 57.0.0 milestone Oct 1, 2025

etseidl added parquet Changes to the parquet crate api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Oct 1, 2025

etseidl mentioned this pull request Oct 2, 2025

[thrift-remodel] Incorporate changes made to geospatial statistics #8528

Merged

alamb mentioned this pull request Oct 2, 2025

Test thrift-remodel branch with DataFusion #8513

Closed

alamb mentioned this pull request Oct 4, 2025

[TESTING] Test thrift-remodel branch with datafusion apache/datafusion#17916

Closed

Merge branch 'main' into gh5854_thrift_remodel

cd3ef23

etseidl marked this pull request as ready for review October 7, 2025 18:16

etseidl mentioned this pull request Oct 7, 2025

Reduce allocations in ParquetMetaData for improved performance #8518

Open

This was referenced Oct 8, 2025

Reduce Allocations When Reading Parquet Metadata #5775

Open

Consolidate THRIFT.md into rustdoc comments #8571

Open

alamb added the performance label Oct 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Complete phase 1 of the Thrift remodel for the parquet crate #8530

Complete phase 1 of the Thrift remodel for the parquet crate #8530

etseidl commented Oct 1, 2025 •

edited by alamb

Loading

Uh oh!

etseidl commented Oct 1, 2025

Uh oh!

etseidl commented Oct 1, 2025

Uh oh!

alamb commented Oct 2, 2025

Uh oh!

alamb commented Oct 2, 2025

Uh oh!

alamb commented Oct 4, 2025

Uh oh!

alamb commented Oct 6, 2025

Uh oh!

alamb commented Oct 6, 2025

Uh oh!

alamb commented Oct 7, 2025

Uh oh!

etseidl commented Oct 7, 2025

Uh oh!

Uh oh!

Complete phase 1 of the Thrift remodel for the parquet crate #8530

Are you sure you want to change the base?

Complete phase 1 of the Thrift remodel for the parquet crate #8530

Conversation

etseidl commented Oct 1, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Oct 1, 2025

Uh oh!

etseidl commented Oct 1, 2025

Uh oh!

alamb commented Oct 2, 2025

Uh oh!

alamb commented Oct 2, 2025

Uh oh!

alamb commented Oct 4, 2025

Uh oh!

alamb commented Oct 6, 2025

Uh oh!

alamb commented Oct 6, 2025

Uh oh!

alamb commented Oct 7, 2025

Uh oh!

etseidl commented Oct 7, 2025

Uh oh!

Uh oh!

etseidl commented Oct 1, 2025 •

edited by alamb

Loading