-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Change default behavior for Parquet PageEncodingStats #8859 #8860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Decode page encoding stats to a compact EncodingMask by default
- Add PageEncodingStatsMode { Mask, Full, Skip } and ParquetPageEncodingStats
- Keep on-disk Parquet metadata format unchanged
- Update tests, benchmarks and docs for the new default
|
Thanks @etseidl for the clarification, that makes sense! 😭 I misunderstood the scope of #8859 and thought it needed more structural changes on top of what was already in main. If #8797 already implements most of this, I’m happy to close this PR and (once #8797 is merged) open a small follow-up that just flips the default for Appreciate the quick feedback 🙏 |
|
Thanks for your enthusiasm 😄 You implemented in an hour what it took me several weeks to arrive at 😅. Please take a look at our many open issues and see if there are any others that pique your interest. Given what you've submitted here I don't think you need to limit yourself to good first issues. Thanks again! |
| - feat: Improve DataType display for `RunEndEncoded` [\#8596](https://github.com/apache/arrow-rs/pull/8596) [[arrow](https://github.com/apache/arrow-rs/labels/arrow)] ([Weijun-H](https://github.com/Weijun-H)) | ||
| - Add `ArrowError::AvroError`, remaining types and roundtrip tests to `arrow-avro`, [\#8595](https://github.com/apache/arrow-rs/pull/8595) [[arrow](https://github.com/apache/arrow-rs/labels/arrow)] ([jecsand838](https://github.com/jecsand838)) | ||
| - \[thrift-remodel\] Refactor Thrift encryption and store encodings as bitmask [\#8587](https://github.com/apache/arrow-rs/pull/8587) [[parquet](https://github.com/apache/arrow-rs/labels/parquet)] ([etseidl](https://github.com/etseidl)) | ||
| - Change default page encoding stats decoding mode from `Full` to `Mask` in `ParquetMetaDataOptions` [\#8859](https://github.com/apache/arrow-rs/pull/8859) [[parquet](https://github.com/apache/arrow-rs/labels/parquet)] ([user](https://github.com/user)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRs shouldn't be changing the changelog for an already released version
@joe-ucp I suspect this PR is created by claude/codex or a similar tool. This is totally fine but we need your help to review the PR first before asking us to review it
It is not yet an arrow project guideline, but I personally think the guidelines from https://datafusion.apache.org/contributor-guide/index.html#ai-assisted-contributions helps undertand the background and what we generally expect from contrubutors
Thanks again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Thanks for calling this out, and for the link to the AI-assisted contribution guidelines.
You’re right: I used an LLM/Copilot-style workflow to help draft this change. I did run cargo fmt, cargo clippy, and cargo test -p parquet, but I’m still pretty new to Git/GitHub flow and self-taught, so I missed that I shouldn’t touch the changelog for already released versions and that I was overlaping with the work in #8797.
For this PR I’m happy to let it be closed in favor of #8797.
I’ll make sure future contributions are better self-reviewed before I open a PR and that I follow the AI guidelines you linked. Thanks again for the feedback (and the patience)!
EncodingMaskby defaultPageEncodingStatsMode { Mask, Full, Skip }andParquetPageEncodingStatsWhich issue does this PR close?
PageEncodingStats#8859.Rationale for this change
Decoding page encoding statistics into a full
Vec<PageEncodingStats>for every column can allocate a lot of memory and is rarely needed by callers. A compact mask representation keeps enough information for most use cases (which encodings appear) while reducing allocations and metadata parsing cost.At the same time, we still want to support users who rely on the full stats, and also allow completely skipping stats when they are not needed.
What changes are included in this PR?
PageEncodingStatsModewith three modes:Mask(default): decode stats to a compactEncodingMaskFull: decode fullVec<PageEncodingStats>as beforeSkip: do not decode page encoding statsParquetPageEncodingStatsto store stats in memory as either:Mask(EncodingMask)orFull(Vec<PageEncodingStats>)ParquetMetaDataOptionswithpage_encoding_stats_mode, defaulting toMaskPageEncodingStatsModeinto the Thrift metadata decoder to:ColumnChunkMetaDatato useParquetPageEncodingStatsand add helpers:page_encoding_stats()page_encoding_stats_mask()page_encoding_stats_full()CHANGELOG.mdto describe the new default and how to opt into full statsAre these changes tested?
Yes:
cargo fmt -p parquetcargo clippy -p parquet -- -D warningscargo test -p parquetcargo bench -p parquet -- benches::metadataAre there any user-facing changes?
Yes:
Maskmode instead of always decoding fullVec<PageEncodingStats>.ParquetMetaDataOptions::with_page_encoding_stats_mode(PageEncodingStatsMode::Full)PageEncodingStatsMode::Skip.This is an in-memory behavior change intended for the next major release; the Parquet on-disk format is unchanged.