-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Clean up ArrowReaderMetadata::load_async
#7369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| let parquet_metadata = metadata_reader.decode_footer_metadata(&buf, &footer)?; | ||
|
|
||
| let parquet_metadata = metadata_reader.load_via_suffix_and_finish(self).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was some pushback prior to defaulting to using suffix requests for loading metadata. (Although this looks like it's the impl on AsyncRead and AsyncSeek, and not the ParquetObjectReader, so it may be fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I recall. But the code that's replaced already did seeks from the end of the file. I assume it's ok here because it's not really an object store, just async access to a file in most instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember the discussion about suffix requests, but I went over this logic carefully and it makes a lot of sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to @Xuanwo 's comment here #6157 (comment). But I think that was specifically referring to Azure object storage, and here since we're working with the generic tokio traits, it should be fine.
adamreeve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me although I'm not very familiar with this part of the code. I just have one minor comment
| reader.load_page_index(input).await?; | ||
| metadata = Arc::new(reader.finish()?) | ||
| } | ||
| let metadata = input.get_metadata(Some(&options)).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the note in the doc comment above be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes it should. Thank you 🙏
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @etseidl -- this looks so much nicer to me.
Also thanks to @adamreeve and @kylebarron for the reviews 🙏
|
|
||
| let metadata_reader = ParquetMetaDataReader::new(); | ||
| let metadata_reader = ParquetMetaDataReader::new() | ||
| .with_page_indexes(options.is_some_and(|o| o.page_index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL: is_some_and 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank clippy 😅
|
|
||
| let parquet_metadata = metadata_reader.decode_footer_metadata(&buf, &footer)?; | ||
|
|
||
| let parquet_metadata = metadata_reader.load_via_suffix_and_finish(self).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember the discussion about suffix requests, but I went over this logic carefully and it makes a lot of sense to me
|
I also found myself with the urge to improve the documentation: |
|
Let's do it! |
|
Thanks again @etseidl @adamreeve and @kylebarron |
Which issue does this PR close?
Rationale for this change
ArrowReaderMetadata::load_asyncsometimes had to do multiple passes to fully load Parquet metadata when page indexes were requested. This is becauseAsyncFileReader::get_metadatafunction has no way of knowing if page indexes are desired. Recent API changes have allowed for passing this information toAsyncFileReader, so the extra page index logic inload_asyncshould no longer be necessary.This version will still do multiple fetches since no prefetch hint is passed to the metadata reader. A follow on PR could add this hint to
ArrowReaderOptions, but that would be a breaking change.What changes are included in this PR?
Convert
AsyncFileReader::get_metadatato use the newParquetMetaDataReader::load_via_suffix_and_finishAPI to reduce code duplication and add aMetadataSuffixFetchimplementation to allow its use inArrowReaderMetadata::load_async.Are there any user-facing changes?
No