Skip to content

Parquet: Add ability to project rowid in parquet reader #7444

@thinkharderdev

Description

@thinkharderdev

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Add a method

impl ArrowReaderBuilder<T> { 
   pub fn with_rowid(self, field_name: impl Into<String>) -> Self {...}
}

that, will project a column with name field_name into the output of the reader that contains the row offset in the parquet file of each row

Describe the solution you'd like

Prototype implementation can be found here coralogix@3d4a09f

If this seems like something we can merge upstream I can create a PR to master in the upstream repo

Describe alternatives you've considered

Not do it :)

Additional context

I'm trying to implement something like https://clickhouse.com/blog/clickhouse-gets-lazier-and-faster-introducing-lazy-materialization in a way that does not require re-scanning metadata or re-scanning fields that have already been read and decoded.

The basic idea is that you have a parquet file with some projections and a TopK sort on some (ideally small) subset of those projections. So you can:

  1. Read the columns required for the topk sort along with their row offsets
  2. Build the topk and discard everything else
  3. Use the rowids from the topk rows to build a RowSelection to read remaining columns
  4. Read remaining columns using row selection.

The current implementation of parquet reader can't support this if you have row filters you are pushing down to the scan since the offset of rows from the scan in 1 will not align with the offset of rows in the file.

But it is relatively straightforward to keep track of the offsets during scan and just return them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions