-
Notifications
You must be signed in to change notification settings - Fork 1k
Complete phase 1 of the Thrift remodel for the parquet crate #8530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…8048) **Note: this targets a feature branch, not main** # Which issue does this PR close? - Part of #5854. - related to #6129 # Rationale for this change This is the first step towards a rework of Parquet metadata handling. This PR attempts to remove as many references as possible to the `parquet::format` module in the public API. This is done by creating new enums and structs that mirror their `format` counterparts and using them in publicly exposed structures like `FileMetaData`. # What changes are included in this PR? # Are these changes tested? Current tests should suffice for now. More thorough tests will be added as needed. # Are there any user-facing changes? Yes, public facing interfaces should no longer expose `format`
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. - Related to #6129 # Rationale for this change Next step. # What changes are included in this PR? Attempt to make use of macros originally developed by @jhorstmann to implement some thrift enums and unions in `basic.rs`. Also adds yet another spin on a thrift decoder based heavily on `TCompactSliceInputProtocol`. The current approach is to use this new decoder with `TryFrom` impls on the data structures in question. This PR does not yet complete the process far enough to parse a parquet footer, but I'm putting it out here to get early feedback on the design. Some structures already deviate enough from their thrift counterparts that the macro based parsing is not practical. Likewise, the macro approach doesn't work for structs that need lifetime annotations (necessary for reading binary fields as slices). # Are these changes tested? Not yet. # Are there any user-facing changes? Yes.
…aData` (#8111) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change # What changes are included in this PR? This PR completes reading of the `FileMetaData` and `RowGroupMetaData` pieces of the `ParquetMetaData`. Column indexes and encryption will be follow-on work. This replaces the macro for generating structs with a more general one that can take visibility and lifetime specifiers. Also (temporarily) adds a new function `ParquetMetaDataReader::decode_file_metadata` which should be a drop-in replacement for `ParquetMetaDataReader::decode_metadata`. Still todo: 1. Add some tests that verify this produces the same output as `ParquetMetaDataReader::decode_metadata` 2. Read column indexes with new decoder 3. Read page headers with new decoder 4. Integrate with @alamb's push decoder work #8080 # Are these changes tested? Not yet # Are there any user-facing changes? Yes
# Which issue does this PR close? **Note: this targets a feature branch, not main** We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of #5854. # Rationale for this change Speed # What changes are included in this PR? Still a work in progress, but begins the process of converting page index parsing to the new thrift decoder. # Are these changes tested? This PR actually uses the new decoder when parsing the page indexes using the existing machinery. As such all tests involving the page indexes should apply to this code. # Are there any user-facing changes? Yes
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change Parsing the column index is _very_ slow. The largest part of the cost is taking the thrift structure (which is a struct of arrays) and converting it to an array of structs. This results in a large number of allocations when dealing with binary columns. This is an experiment in creating a new structure to hold the column index info that is a little friendlier to parse. It may also be easier to consume on the datafusion side. # What changes are included in this PR? A new `ColumnIndexMetaData` enum is added along with a type parameterized `NativeColumnIndex` struct. # Are these changes tested? No, this is an experiment only. If this work can be honed into an acceptible `Index` replacement, then tests will be added at that time. # Are there any user-facing changes? Yes, this would be a radical change to the column indexes in `ParquetMetaData`.
…ng of page indexes (#8190) # Which issue does this PR close? **Note: this targets a feature branch, not main** We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of #5854. # Rationale for this change Add a custom parser for `PageLocation` as the decoding of this struct is one of several hot spots. # What changes are included in this PR? This adds a faster means of obtaining the struct field ids to `ThriftCompactInputProtocol`. For a small struct (3 fields) with all of them required, we can save a good bit of time bypassing `ThriftCompactInputProtocol::read_field_begin` which is very general and can handle out-of-order fields, among other things. By adding a new function `read_field_header`, we can avoid the costly branching that occurs when calculating the new field id (as well as special handling needed for boolean fields). Field validation is then handled on the consuming side while decoding the `PageLocation` struct. Note that to obtain the speed up seen, we need to assume the fields will always be in order, and the field ids will all be encoded as field deltas. This is probably a fairly safe assumption, but there does exist the possibility of custom thrift writers that use absolute field ids. If we encounter such a writer in the wild, this change will need to be reverted. # Are these changes tested? These changes should be covered by existing changes. # Are there any user-facing changes? None beyond the changes in this branch.
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change Begins adding custom thrift write support. # What changes are included in this PR? Adds traits to aid in writing of thrift and modifies thrift macros to generate writing code. # Are these changes tested? Yes, adds some roundtrip tests to validate encoded data can be decoded to the same state. # Are there any user-facing changes? No
…ter decryption code (#8313) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change Continuing the remodel # What changes are included in this PR? This begins the process of replacing the current footer parsing code with the new version. As part of this much of the decryption machinery also needed to be changed. # Are these changes tested? Should be covered by existing tests # Are there any user-facing changes? Yes
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change As I started on decoding thrift page headers, I found that the way I had been going was no longer going to work. This PR begins the process of abstracting the thrift reader to allow for other implementations. # What changes are included in this PR? In addition to reworking the reader itself, this PR moves away from the previous `TryFrom` approach and instead adds a `ReadThrift` trait. # Are these changes tested? Should be covered by existing tests # Are there any user-facing changes? Yes
…ers (#8376) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change This continues the remodel, moving on to `PageHeader` support. # What changes are included in this PR? Swaps out old `format` page header structs for new ones. This also adds a `Read` based implementation of the thrift compact protocol reader (the sizes of the Thrift encoded page headers are not knowable in advance, so we need a way to read them from the thrift input stream used by the page decoder). This PR also makes decoding of the `Statistics` in the page header optional (defaults to `false`). We do not use them, and the decode takes a good chunk of time. # Are these changes tested? These changes should be covered by existing tests # Are there any user-facing changes? Yes, page level stats are no longer decoded by default
# Which issue does this PR close? **Note: this targets a feature branch, not main** - Closes #5854. # Rationale for this change Continues the remodel by implementing writing of the page index structures. # What changes are included in this PR? This PR removes the old `parquet::file::page_index::Index` enum and replaces with the new `ColumnIndexMetaData` struct. # Are these changes tested? Covered by existing tests # Are there any user-facing changes? Yes.
…8476) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change This should complete the major work of the remodel. Follow-on tasks include performance tweaks, documentation, and adding the ability to skip unneeded structures when decoding. None of the latter should involve breaking changes. # What changes are included in this PR? The major change is removing all of the code to convert to `parquet::format` structures. The only places those are still used are in the benchmark and `bin` code which are not strictly in the parquet crate. Once those are cleaned up we can deprecate the `format` module. This also adds a markdown file documenting the use of the new Thrift macros. # Are these changes tested? Should be covered by existing tests. # Are there any user-facing changes? Yes --------- Co-authored-by: Matthijs Brobbel <[email protected]>
Most recent benchmark (on Core i7 macbook)
|
Benchmark on Intel i7-12700K
|
…8528) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change This brings over changes to handling of geo-spatial statistics introduced by @paleolimbot in #8520. # What changes are included in this PR? Primarily adds documentation and tests to changes already made. The only significant change is adding a `Default` implementation for `EdgeInterpolationAlgorithm`. # Are these changes tested? Yes # Are there any user-facing changes? Yes --------- Co-authored-by: Matthijs Brobbel <[email protected]>
FYI I am starting to test this branch with DataFusion but it will take me a while as the arrow upgrade is quite substantial due to how DataTypes are represented |
I got far enough to start testing DataFusion with this actual branch. I am not done yet but making good progress #8513 (comment) |
Updating this branch to get the latest fixes (so I can get the DataFusion tests to pass) |
Update on testing: I haven't found any API problems yet in DataFusion (I had to make some changes, but it was all manageable) I am still reviewing / updating tests, but so far it looks good to me. I hope to be finished tomorrow |
Update is I have completed my testing (see #8513 (comment) for details)
My conclusion is that the thrift-remodel branch is ready to merge to main |
Thanks for all the testing @alamb! I've run through my stash of parquet files and have found no issues as well. I'm marking this read for review. |
Which issue does this PR close?
Rationale for this change
See issue, but this change is needed to allow greater control over Parquet metadata handling. These changes speed up Thrift decoding about 2X, and enable further optimizations down the line, such as allowing selective decoding of only that metadata that is needed.
What changes are included in this PR?
In short, the
format
crate is now no longer necessary. This crate will decode from Thrift encoded bytes straight to the metadata objects defined by this crate (ParquetMetaData
and children).Are these changes tested?
Yes, but more should be added by follow-on PRs
Are there any user-facing changes?
Yes, many.