Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Apr 1, 2025

Which issue does this PR close?

Rationale for this change

ArrowReaderMetadata::load_async sometimes had to do multiple passes to fully load Parquet metadata when page indexes were requested. This is because AsyncFileReader::get_metadata function has no way of knowing if page indexes are desired. Recent API changes have allowed for passing this information to AsyncFileReader, so the extra page index logic in load_async should no longer be necessary.

This version will still do multiple fetches since no prefetch hint is passed to the metadata reader. A follow on PR could add this hint to ArrowReaderOptions, but that would be a breaking change.

What changes are included in this PR?

Convert AsyncFileReader::get_metadata to use the new ParquetMetaDataReader::load_via_suffix_and_finish API to reduce code duplication and add a MetadataSuffixFetch implementation to allow its use in ArrowReaderMetadata::load_async.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 1, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Apr 1, 2025

cc @adamreeve @kylebarron


let parquet_metadata = metadata_reader.decode_footer_metadata(&buf, &footer)?;

let parquet_metadata = metadata_reader.load_via_suffix_and_finish(self).await?;
Copy link
Member

@kylebarron kylebarron Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was some pushback prior to defaulting to using suffix requests for loading metadata. (Although this looks like it's the impl on AsyncRead and AsyncSeek, and not the ParquetObjectReader, so it may be fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I recall. But the code that's replaced already did seeks from the end of the file. I assume it's ok here because it's not really an object store, just async access to a file in most instances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember the discussion about suffix requests, but I went over this logic carefully and it makes a lot of sense to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to @Xuanwo 's comment here #6157 (comment). But I think that was specifically referring to Azure object storage, and here since we're working with the generic tokio traits, it should be fine.

Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me although I'm not very familiar with this part of the code. I just have one minor comment

reader.load_page_index(input).await?;
metadata = Arc::new(reader.finish()?)
}
let metadata = input.get_metadata(Some(&options)).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the note in the doc comment above be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes it should. Thank you 🙏

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @etseidl -- this looks so much nicer to me.

Also thanks to @adamreeve and @kylebarron for the reviews 🙏


let metadata_reader = ParquetMetaDataReader::new();
let metadata_reader = ParquetMetaDataReader::new()
.with_page_indexes(options.is_some_and(|o| o.page_index));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL: is_some_and 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank clippy 😅


let parquet_metadata = metadata_reader.decode_footer_metadata(&buf, &footer)?;

let parquet_metadata = metadata_reader.load_via_suffix_and_finish(self).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember the discussion about suffix requests, but I went over this logic carefully and it makes a lot of sense to me

@alamb
Copy link
Contributor

alamb commented Apr 3, 2025

I also found myself with the urge to improve the documentation:

@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

Let's do it!

@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

Thanks again @etseidl @adamreeve and @kylebarron

@alamb alamb merged commit d1d2c95 into apache:main Apr 4, 2025
17 checks passed
@etseidl etseidl deleted the cleanup_async_reader branch May 28, 2025 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request

4 participants