-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Add a method
impl ArrowReaderBuilder<T> {
pub fn with_rowid(self, field_name: impl Into<String>) -> Self {...}
}
that, will project a column with name field_name into the output of the reader that contains the row offset in the parquet file of each row
Describe the solution you'd like
Prototype implementation can be found here coralogix@3d4a09f
If this seems like something we can merge upstream I can create a PR to master in the upstream repo
Describe alternatives you've considered
Not do it :)
Additional context
I'm trying to implement something like https://clickhouse.com/blog/clickhouse-gets-lazier-and-faster-introducing-lazy-materialization in a way that does not require re-scanning metadata or re-scanning fields that have already been read and decoded.
The basic idea is that you have a parquet file with some projections and a TopK sort on some (ideally small) subset of those projections. So you can:
- Read the columns required for the topk sort along with their row offsets
- Build the topk and discard everything else
- Use the rowids from the topk rows to build a
RowSelectionto read remaining columns - Read remaining columns using row selection.
The current implementation of parquet reader can't support this if you have row filters you are pushing down to the scan since the offset of rows from the scan in 1 will not align with the offset of rows in the file.
But it is relatively straightforward to keep track of the offsets during scan and just return them.