-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add options to control various aspects of Parquet metadata decoding #8763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Here's an excerpt from a run of the new benchmark that shows the schema is actually skipped. This should get even faster with the metadata index (#8714) |
| .with_column_index_policy(self.column_index) | ||
| .with_metadata_options(self.metadata_options.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I could see moving the page index policy into the MetadataOptions and then deprecating a bunch of setters.
parquet/src/file/metadata/parser.rs
Outdated
| // the credentials and keys needed to decrypt metadata | ||
| file_decryption_properties: Option<Arc<FileDecryptionProperties>>, | ||
| // metadata parsing options | ||
| metadata_options: Option<MetadataOptions>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this should be Option<Arc<MetadataOptions>> everywhere.
|
This may help with #5999 |
| /// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder | ||
| #[derive(Default, Debug, Clone)] | ||
| pub struct MetadataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this means (1) User provided schema or (2) only (min, max, etc) columns in schema_descr be decoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's (1). Say you have a large number of files that share the same schema, there's no need to decode them all. Just grab the schema from the first file and use it for all the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a ticket that explains the use case a bit more;
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this API looks good to me (and actually closes an existing ticket)
| /// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder | ||
| #[derive(Default, Debug, Clone)] | ||
| pub struct MetadataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a ticket that explains the use case a bit more;
|
(I didn't approve it b/c it is still marked as a draft) |
Thanks @alamb. I'm still messing around with the API. I think I like hiding the new options object in the file reader APIs, and just exposing it for the metadata readers. The last wrinkle is figuring out a good way to share across the |
|
Ok, I think this is ready now. Right now I'm mildly against pulling in the page index policies. They are used at a higher level and I don't think it's worth the thrash to move them. Instead I want to focus on options that impact the |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl - this looks great to me
| supplied_schema: Option<SchemaRef>, | ||
| /// Policy for reading offset and column indexes. | ||
| pub(crate) page_index_policy: PageIndexPolicy, | ||
| /// Options to control reading of Parquet metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the ArrowReaderOptions and ArrowReaderMetadata structures and their use, and I agree this is the appropriate structure to add metadata parsing to.
Do you think it eventually makes sense to move the other fields from ArrowReaderOptions to ParquetMetaDataOptions? (e.g. supplied_schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking perhaps the page_index_policy, but the other things in ArrowReaderOptions are more Arrow specific rather than Parquet. That might get confusing.
| /// [Parquet Spec]: https://github.com/apache/parquet-format#metadata | ||
| pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> { | ||
| decode_metadata(buf) | ||
| decode_metadata(buf, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should start directing people to the push metadata decoder (the metadata reader is getting pretty complicated...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be nice. Maintaining two public APIs that do pretty much the same thing is a bit too much.
Co-authored-by: Andrew Lamb <[email protected]>
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @alamb. There's one last thing I have a question on that maybe you could opine on. 🙏
…erMetadata` (#8798) # Which issue does this PR close? - Related to #8763 - Related to #7973 # Rationale for this change While reviewing and hunting around for structs I got confused about the difference between `ArrowReaderOptions` and `ArrowReaderMetadata` # What changes are included in this PR? Add some doc strings to try and make it easier for people to navigate the documentation # Are these changes tested? by CI # Are there any user-facing changes? New documentation. No functional changes
|
If there are no objections I'll merge this tomorrow. A follow-on is already in the works to deal with the page encoding statistics (#8797), with other statistics objects to be handled after that. I've also got some speed ups for the thrift skipping code on tap (replace the current version that uses recursion with one that maintains state on the stack). |
Which issue does this PR close?
SchemaDescriptorPtracrossParquetMetadataobjects #5999Rationale for this change
This is a first attempt at an object to help control the parsing of the Parquet metadata.
What changes are included in this PR?
Adds a new
MetadataOptionsstruct, and plumbs it down into the Thrift decoder code. The only option for now is to pass in a schema, which then causes the decoder to skip decoding the schema contained in the footer.Also adds to the metadata bench to demonstrate the time savings from reusing the schema.
Are these changes tested?
Yes, adds a new test.
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.