Add support for file row numbers in Parquet readers #7307

jkylling · 2025-03-18T18:37:35Z

Which issue does this PR close?

Closes #7299.

What changes are included in this PR?

In this PR we:

Add configuration to the ArrowReaderBuilder to set a row_number_column used to extend the read RecordBatches with an additional column with file row numbers.
Keep track of the first row number in each row group in the file. This is computed from the file metadata.
Add an ArrayReader to the vector of ArrayReaders reading columns from the Parquet file, if the row_number_column is set in the reader configuration. This is a RowNumberReader, which is a special ArrayReader. It reads no data from the Parquet pages, but uses the first row numbers in the RowGroupMetaData to keep track of progress.
Add some basic tests and fuzz tests of the functionality.

The RowGroupMetaData::first_row_number is Option<i64>, since it is possible that the row number is unknown (I encountered an instance of this when trying to integrate this PR in delta-rs), and it's better if None is used instead of some special integer value.

The performance impact of this PR should be negligible when the row number column is not set. The only additional overhead would be the tracking of the first_row_number of each row group.

Are there any user-facing changes?

We add an additional public method:

ArrowReaderBuilder::with_row_number_column

There are a few breaking changes as we touch a few public interfaces:

RowGroupMetaData::from_thrift and RowGroupMetaData::from_thrift_encrypted takes an additional parameter first_row_number: Optional<i64>.
The trait RowGroups has an additional method RowGroups::row_groups. Potentially this method could replace the RowGroups::num_rows method or provide a default implementation for it.
An additional error variant ParquetError::RowGroupMetaDataMissingRowNumber.

I'm very open to suggestions on how to reduce the amount of breaking changes.

etseidl · 2025-03-25T22:29:41Z

Thanks for you submission @jkylling, I'll try to get a first pass review done this week. In the meantime please add the Apache license to row_number.rs and correct the other lint errors. 🙏

jkylling · 2025-03-26T07:28:46Z

Thanks for you submission @jkylling, I'll try to get a first pass review done this week. In the meantime please add the Apache license to row_number.rs and correct the other lint errors. 🙏

Updated. Looking forward to the first review!

I was very confused as to why cargo format did not work properly, but looks like you are already aware of this (#6179) :)

etseidl

Partial review, just a few nits for now.

parquet/src/arrow/array_reader/builder.rs

parquet/src/arrow/arrow_reader/mod.rs

parquet/src/errors.rs

parquet/src/arrow/array_reader/builder.rs

etseidl

Thanks again @jkylling for taking this on. I've finished my first pass and have only one reservation. Otherwise it looks good and meets the criteria set forth in #7299 (comment).

etseidl · 2025-03-27T22:58:16Z

parquet/src/arrow/array_reader/row_number.rs

+            row_groups: VecDeque::from(
+                row_groups
+                    .into_iter()
+                    .map(TryInto::try_into)
+                    .collect::<Result<Vec<_>>>()?,
+            ),
+        })


I'm finding myself a bit uneasy with adding the first row number to the RowGroupMetaData. Rather than that, could this bit here instead be changed to keep track of the first row number while populating the deque? Is there some wrinkle I'm missing? Might the row groups be filtered before instantiating the RowNumberReader?

Answered my own question...it seems there's some complexity here at least when using the async reader.

Yes, I believe we don't have access to all row groups when creating the array readers.

I took a quick look at the corresponding Parquet reader implementations for Trino and parquet-java.

Trino:

Has a boolean to include a row number column, https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L112

Includes this column when the boolean is set: https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L337

Has a special block reader for reading row indexes https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L385-L393 I believe the positions play a similar role to our RowSelectors.

Gets row indexes from RowGroupInfo, a pruned version of https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L456

Populates the fileRowOffset by iterating through the row groups: https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ParquetMetadata.java#L107-L111

parquet-java:

Has a method for tracking the current row index: https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L150-L155

This row index is based on an iterator which starts form a row group row index, https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L311-L339

This row group row index is initialized by iterating through the row groups: https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1654-L1656 (mapping obtained here: https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1496-L1506)

Their approaches are rather similar to ours.

One take away is that the above implementations do not be keep the full RowGroupMetaDatas around as we do by requiring an iterator over RowGroupMetadata in the RowGroups trait. This is likely a good idea as this struct can be quite large. What do you think about changing the RowGroups trait to something like below?

/// A collection of row groups pub trait RowGroups { /// Get the number of rows in this collection fn num_rows(&self) -> usize { self.row_group_infos.iter().map(|info| info.num_rows).sum() } /// Returns a [`PageIterator`] for the column chunks with the given leaf column index fn column_chunks(&self, i: usize) -> Result<Box<dyn PageIterator>>; /// Returns an iterator over the row groups in this collection fn row_group_infos(&self) -> Box<dyn Iterator<Item = &RowGroupInfo> + '_>; } struct RowGroupInfo { num_rows: usize, row_index: i64, }

I don't think this is necessary...the full ParquetMetaData is already available everywhere this trait is implemented, so I don't see a need to worry about adding another metadata structure here.

etseidl · 2025-03-27T23:05:36Z

parquet/src/file/metadata/mod.rs

        self.num_rows
    }

+    /// Returns the first row number in this row group.


Suggested change

/// Returns the first row number in this row group.

/// Returns the global index number for the first row in this row group.

And perhaps use first_row_index instead? That may be clearer.

Agree. Updated.

alamb · 2025-03-28T15:47:10Z

Thanks @jkylling and @etseidl -- I think we need to be very careful to balance adding new features in the parquet reader with keeping it fast and maintainable. I haven't had a chance to look at this PR yet, but I do worry about performance and complexity

jkylling · 2025-04-09T16:05:05Z

@etseidl @alamb what would it take to move this PR forward? We could add benchmarks to settle concerns about performance, or refactor to try to reduce complexity?

etseidl · 2025-04-09T19:31:10Z

@etseidl @alamb what would it take to move this PR forward? We could add benchmarks to settle concerns about performance, or refactor to try to reduce complexity?

Sorry @jkylling, things have been rather hectic lately. I'll try to give it another look this week, along with some benchmarking (but I don't expect any perf hit). I'll just note that since this is a breaking change, it won't be able to be merged until the next major release (July-ish IIRC), so there's plenty of time to get this right. Also, I'll be deferring to those with more project history (e.g. @alamb @tustvold) as to whether the approach here is the best way to achieve the goal. Thank you for your contribution and your patience! 😄

alamb · 2025-04-10T16:18:58Z

Yeah, sorry I also have been slammed with many other projects. I'll try and find time to look but I suspect it may be a while

jkylling · 2025-04-12T15:08:07Z

Thank you for the update, and totally understand other responsibilities are taking up your time. I'll keep on being patient, and maybe do some minor improvements to this PR (use a smaller struct than the full RowGroupMetadata, and add some benchmarks for the RowNumberReader). Just want to make sure we have this PR ready before the next major release approaches.

etseidl · 2025-04-14T23:19:25Z

Yes, a benchmark that shows minimal impact with no row numbers would be nice (and hopefully adding row numbers won't be bad either 😄).

etseidl · 2025-04-14T23:26:15Z

parquet/src/file/metadata/reader.rs


+        let mut first_row_number = 0;
        let mut row_groups = Vec::new();
+        t_file_metadata.row_groups.sort_by_key(|rg| rg.ordinal);


Are these sorts necessary? Would the ordinal ever be out of order? They shouldn't be if I understand the encryption spec correctly.

scovich · 2025-04-16T03:25:35Z

I think we need to be very careful to balance adding new features in the parquet reader with keeping it fast and maintainable. I haven't had a chance to look at this PR yet, but I do worry about performance and complexity

100% agreed that simplicity and maintainability are paramount... but row numbers are a pretty fundamental feature that's very hard to emulate in higher layers if the parquet reader doesn't support them. Back when https://github.com/delta-io/delta first took a dependency on row numbers, spark's parquet reader did not yet support them; we had to disable row group pruning and other optimizations in order to make it (mostly) safe to manually compute row numbers in the query engine. It was really painful.

AFAIK, most parquet readers now support row numbers. We can add DuckDB and Iceberg to the ones already mentioned above. I was actually surprised to trip over this PR and learn that arrow-parquet does not yet support row numbers.

parquet/src/arrow/array_reader/row_number.rs

scovich · 2025-05-03T21:38:34Z

parquet/src/arrow/array_reader/builder.rs

    field: &ParquetField,
    mask: &ProjectionMask,
    row_groups: &dyn RowGroups,
+    row_number_column: Option<&str>,


Maybe a crazy idea, but wouldn't the implementation be simpler (and more flexible) with a RowNumber extension type? Then users could do e.g.

Field::new("row_index", DataType::Int64, false).with_extension_type(RowNumber))

and build_primitive_reader could just check for it, no matter where in the schema it hides, instead of implicitly adding an extra column to the schema?

Update: I don't think raw parquet types support metadata, so this may not be an option.

This would simplify usage of the feature. Having to keep track of the additional row number column is quite cumbersome in clients of this API. One option could be to extend ParquetFieldType with an additional row number type and add it based on the extension type in ArrowReaderMetadata::with_supplied_metadata? @etseidl @alamb what do you think about this approach?

What do other parquet readers do to represent row numbers in their output schema?

What do other parquet readers do to represent row numbers in their output schema?

#7307 (comment), posted Apr 15, might be a starting point?

AFAIK, most parquet readers now support row numbers. We can add DuckDB and Iceberg to the ones already mentioned above.

Duckdb uses a column schema type approach. Interestingly, that's new -- last time I looked (nearly a year go) it required the reader to pass options along with the schema, and one of the options was to request row numbers (which then became an extra unnamed column at the end of the regular schema). I think that approach didn't scale as they started needing more and more special column types. I see geometry, variant, and non-materialiaed expressions, for example.

Iceberg's parquet reader works almost exclusively from field ids, and row index has a baked in field id from the range of metadata row ids.

Spark uses a metadata column approach, identified by a special name (_metadata._rowid); I don't remember how precisely that maps to the underlying parquet reader.

🤔 maybe we could add an arrow extension type, similarly to what we are doing with Variant and Geometry -- so someone would request a column "foo: int64" with an arrow extension type of "row_number" or something 🤔

That could potentially work, but the problem is a row number column is only meaningful in a read request schema. Once row numbers hit the output, they're just normal int64 values from then on. Things get a lot harder to reason about if the extension type persists. For example, in a join of multiple tables, where each scan is producing row numbers for its respective files, one could easily end up with two row number columns in the join's output. And the parquet writer would definitely need to block writing such columns, or at least strip away the metadata?

Any chance we could also return this column with a field ID specified by the user? For Iceberg, we would like the field ID to be according to the spec, and ideally the record batches come already with such schema, rather than having to post-process them.

scovich · 2025-05-04T00:13:02Z

parquet/src/arrow/array_reader/row_number.rs

+struct RowGroupSizeIterator {
+    row_groups: VecDeque<RowGroupSize>,
+}
+
+impl RowGroupSizeIterator {
+    fn try_new<I>(row_groups: impl IntoIterator<Item = I>) -> Result<Self>
+    where
+        I: TryInto<RowGroupSize, Error = ParquetError>,


It seems like this whole RowGroupSizeIterator thing is a complicated and error-prone way of chaining several Range<i64>? Can we use standard iterator machinery instead?

pub(crate) struct RowNumberReader { buffered_row_numbers: Vec<i64>, remaining_row_numbers: std::iter::Flatten<std::vec::IntoIter<std::ops::Range<i64>>>, } impl RowNumberReader { pub(crate) fn try_new<'a>( row_groups: impl Iterator<Item = &'a RowGroupMetaData>, ) -> Result<Self> { let ranges = row_groups .map(|rg| { let first_row_number = rg.first_row_index().ok_or(ParquetError::General( "Row group missing row number".to_string(), ))?; Ok(first_row_number..first_row_number + rg.num_rows()) }) .collect::<Result<Vec<_>>>()?; Ok(Self { buffered_row_numbers: Vec::new(), remaining_row_numbers: ranges.into_iter().flatten(), }) } // Use `take` on a `&mut Iterator` to consume a number of elements without consuming the iterator. fn take(&mut self, batch_size: usize) -> impl Iterator<Item = i64> { (&mut self.remaining_row_numbers).take(batch_size) } } impl ArrayReader for RowNumberReader { fn read_records(&mut self, batch_size: usize) -> Result<usize> { let starting_len = self.buffered_row_numbers.len(); self.buffered_row_numbers.extend(self.take(batch_size)); Ok(self.buffered_row_numbers.len() - starting_len) } fn skip_records(&mut self, num_records: usize) -> Result<usize> { Ok(self.take(num_records).count()) }

This is much simpler. Thank you! I suspect we are missing out on some performance in skip_records with this, but the bulk of the data pruning will likely have happened by pruning Parquet row groups already.

@scovich I see you are involved in the maintenance of delta-kernel-rs. If you are interested, I've started on an implementation of deletion vector read support in delta-rs in this branch, based on a back port of an early version of this PR to arrow-54.2.1. The PR is still very rough, but the read path has got okay test coverage and it's able to read tables with deletion vectors produced by Spark correctly. The write support for deletion vectors is rudimentary (deletion vectors are only used for deletes when configured, and deleting from the same file twice is unsupported), and is mostly there to be able to unit test the read support. Unfortunately, I've not had time to wokr on this lately.

FWIW @zhuqi-lucas and I are working on improvements to the filter application here, which may result in some additional API churn:

Poc for adaptive parquet predicate pushdown(bitmap/range) with page cache(3 data pages) #7454

Co-authored-by: scovich <[email protected]>

16pierre · 2025-09-18T14:21:07Z

row numbers are a pretty fundamental feature that's very hard to emulate in higher layers if the parquet reader doesn't support them

+1 on this painpoint, working around this lack of capability from a client perspective is very challenging and comes with a bunch of correctness risks (e.g. we can write some rowIndex column client-side, but then we have to be 100% sure that the emitted rowIndex will perfectly match the Parquet files, which can get quite tricky especially in multi-threaded executions etc.)

Would love to this feature landing in Arrow-rs + Datafusion

alamb · 2025-09-19T15:15:19Z

row numbers are a pretty fundamental feature that's very hard to emulate in higher layers if the parquet reader doesn't support them

+1 on this painpoint, working around this lack of capability from a client perspective is very challenging and comes with a bunch of correctness risks (e.g. we can write some rowIndex column client-side, but then we have to be 100% sure that the emitted rowIndex will perfectly match the Parquet files, which can get quite tricky especially in multi-threaded executions etc.)

Would love to this feature landing in Arrow-rs + Datafusion

Thanks @16pierre -- I agree there is no good workaround for adding row numbers to the output of the parquet reader

I think the biggest thing we need to do is to sort out the API for "how does a user request the (virtual) row number column" as todays ProjectionMask is insufficient

@scovich and @etseidl 's idea to use some sort of Arrow metadata is interesting, but I am not quite sure how it would look

scovich · 2025-09-19T21:23:49Z

I think the biggest thing we need to do is to sort out the API for "how does a user request the (virtual) row number column" as todays ProjectionMask is insufficient

@scovich and @etseidl 's idea to use some sort of Arrow metadata is interesting, but I am not quite sure how it would look

The "standard" way in most engines I've seen would (in arrow-rs) include an extension type, that the parquet reader recognizes, in the parquet reader's read schema. Nice, because other readers could choose to honor the same extension type and produce row indexes as well. But it does open the question of whether the parquet reader's output should strip away the metadata -- since arguably the row indexes are just normal data once they've been produced -- and if not, how to prevent e.g. writing the values of a "row index" field back to parquet.

Is that a bearable approach? Of should we keep thinking of other ways?

## What changes are proposed in this pull request? This PR follows up on #1266 and adds support for reading the row index metadata column to the default engine. The implementation directly follows the approach proposed in #920 and slightly modifies it to match the new metadata column API. Quoting from #920 > Deletion vectors (and row tracking, eventually) rely on accurate file-level row indexes. But they're not implemented in the kernel's default parquet reader. That means we must rely on the position of rows in data batches returned by each read, and we cannot apply optimizations such as stats-based row group skipping (see #860). > > Add row index support to the default Parquet reader, in the form of a new RowIndex variant of ReorderIndexTransform. [...] The default parquet reader recognizes (the RowIndex metadata) column and injects a transform to generate row indexes (with appropriate adjustments for any row group skipping that might occur). > > Fixes #919 > > NOTE: If/when arrow-rs parquet reader gains native support for row indexes, e.g. apache/arrow-rs#7307, we should switch to using that. Our solution here is not robust to advanced parquet reader features like page-level skipping. row-level predicate pushdown, etc. ### This PR affects the following public APIs None - the breaking changes were introduced in #1266. ## How was this change tested? New UT. Co-authored-by: Zach Schuermann <[email protected]>

vustef · 2025-10-20T10:12:31Z

parquet/src/file/metadata/mod.rs

+    pub fn from_thrift(
+        schema_descr: SchemaDescPtr,
+        mut rg: RowGroup,
+        first_row_number: Option<i64>,


@etseidl seems to have refactored quite a lot of metadata / thrift functionality, but I'm wondering, since this function was used directly by delta-rs, and is thus considered an API, shouldn't we avoid changing it, and rather factor out the code that does the core of conversion, and add another entry-point that has first_row_number?

If this lands in arrow-57, it can be a breaking change. Otherwise... agree we need to be careful.

I merged main now and resolved these conflicts (next step is addressing the API questions). @jkylling could we make first_row_number non-optional now? Also now I think that it's not a breaking change, when it comes to metadata.

vustef · 2025-10-20T10:38:14Z

parquet/src/arrow/array_reader/builder.rs

Thrilled to see this!
Please let me know if I can help in any way. I can make it my top priority to work on this, as we need to make use of it in the next few weeks.
Our use-case is to leverage this from iceberg-rust, which uses ParquetRecordBatchStreamBuilder. The API seems to work for that, but I understand from other comments that it may not be the most desirable one - happy to help either with research/proposal or with the implementation of the chosen option.

Hey @vustef! I'd be very happy if you want to help get row number support into the Parquet reader, either with this PR or through other alternatives. If you want to pick up this PR I can give you commit rights to the branch? Sadly, I don't have capacity to work on this PR at the moment.

Hey @jkylling, yes, please do that if you can, happy to continue where you left.
I'd also need some guidance from @scovich and @alamb on the preferred path forward. And potentially help from @etseidl if I hit a wall with merging metadata changes that happened in the meanwhile (but more on that once I try it out).

vustef · 2025-10-20T14:13:47Z

parquet/src/arrow/arrow_reader/mod.rs

            selection: None,
            limit: None,
            offset: None,
+            row_number_column: None,


Should we also change schema and/or parquet_schema functions to include the row_number_column?

…r-row-numbers

vustef · 2025-10-23T13:10:03Z

@jkylling what tests did you use to run on each push? Wondering about the minimal dev loop I can do before pushing.

jkylling · 2025-10-23T13:23:10Z

@jkylling what tests did you use to run on each push? Wondering about the minimal dev loop I can do before pushing.

It's been a while since I worked on this, but this is what showed up in my shell history:

# variations of the below for running tests
cargo test -p parquet --lib --features async
cargo test -p parquet --lib --features async -- test_read_only_row_numbers
# Also this to format the crate https://github.com/apache/arrow-rs/issues/6179
cargo fmt -p parquet -- --config skip_children=true `find ./parquet -name "*.rs" \! -name format.rs`

…eature tests pass

vustef · 2025-10-24T15:15:21Z

@jkylling at some point I'm going to need to change the description of the PR, not sure how best to do that. Also perhaps it'd be better fit for the current stage to be a draft, until the API is agreed upon (although that is not too important).

etseidl · 2025-10-24T15:24:50Z

I agree that before putting too much effort into this PR we agree on the correct way to implement row numbers (I defer to @alamb and others for the arrow side of this).

One concern I have with the approach here is how to provide exact row numbers if we start selectively reading row group metadata. If we don't have metadata for all preceding row groups, we can't know the starting row number. This at least argues for reverting back to using an Option for the start index.

alamb · 2025-10-25T11:46:10Z

One concern I have with the approach here is how to provide exact row numbers if we start selectively reading row group metadata. If we don't have metadata for all preceding row groups, we can't know the starting row number. This at least argues for reverting back to using an Option for the start index.

I don't think we will be able to provide row numbers if we don't have all the preceding row group metadata

Given the main usecase I have heard so far is indexing and delete vectors, which require exact and accurate row numbers, I think it would be better if the reader simply returned an error if it was configured to read row numbers but didn't have enough information to do so

vustef · 2025-10-25T15:18:45Z

One concern I have with the approach here is how to provide exact row numbers if we start selectively reading row group metadata. If we don't have metadata for all preceding row groups, we can't know the starting row number. This at least argues for reverting back to using an Option for the start index.

I don't think we will be able to provide row numbers if we don't have all the preceding row group metadata

Given the main usecase I have heard so far is indexing and delete vectors, which require exact and accurate row numbers, I think it would be better if the reader simply returned an error if it was configured to read row numbers but didn't have enough information to do so

Sounds good, I'll bring back the Option, and make row number array reader error out if it lacks that information.

I'll hold off with pushing changes here until we finish discussion on the GH issue. And most likely will create a new PR, because I'm not able to update description of this one nor move it between draft and active. (If jkylling doesn't object. Will add due credits to him in there ofc, and keep his commits).

Add support for file row numbers in Parquet readers

f93d36e

github-actions bot added the parquet Changes to the parquet crate label Mar 18, 2025

jkylling mentioned this pull request Mar 18, 2025

Support file row number in Parquet reader #7299

Open

etseidl added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Mar 25, 2025

jkylling added 2 commits March 26, 2025 08:21

Add Apache license header to row_number.rs

e485c0b

Run cargo format

2a62009

etseidl reviewed Mar 26, 2025

View reviewed changes

parquet/src/arrow/array_reader/builder.rs Outdated Show resolved Hide resolved

parquet/src/arrow/arrow_reader/mod.rs Outdated Show resolved Hide resolved

parquet/src/errors.rs Outdated Show resolved Hide resolved

parquet/src/arrow/array_reader/builder.rs Outdated Show resolved Hide resolved

jkylling added 4 commits March 27, 2025 18:02

Change with_row_number_column to take impl Into<String>

fb5126f

Change Option<String> -> Option<&str> in build_array_reader

5350728

Replace ParquetError::RowGroupMetaDataMissingRowNumber with General

188f350

Split test_create_array_reader test into two

37a9d83

etseidl reviewed Mar 27, 2025

View reviewed changes

first_row_number -> first_row_index

41e38fe

jkylling requested a review from etseidl April 9, 2025 15:47

etseidl reviewed Apr 14, 2025

View reviewed changes

etseidl mentioned this pull request Apr 28, 2025

Parquet: Add ability to project rowid in parquet reader #7444

Closed

alamb mentioned this pull request Apr 29, 2025

Support file row index / row id for each file in a ListingTableProvider apache/datafusion#15892

Open

scovich reviewed May 6, 2025

View reviewed changes

scovich mentioned this pull request May 6, 2025

feat: Add row index support to the default parquet reader delta-io/delta-kernel-rs#920

Closed

Simplify RowNumberReader with iterators

1a1e6b6

Co-authored-by: scovich <[email protected]>

dpxcc mentioned this pull request Aug 12, 2025

[datafusion] Deletion filter Mooncake-Labs/moonlink#1369

Open

lbhm mentioned this pull request Sep 10, 2025

feat: Introduce row index metadata column delta-io/delta-kernel-rs#1272

Merged

vustef reviewed Oct 20, 2025

View reviewed changes

vustef added 4 commits October 23, 2025 13:14

Merge remote-tracking branch 'origin/main' into feature/parquet-reade…

bcad87f

…r-row-numbers

add parquet-testing change from the merge

89c1fd1

Fix test_arrow_reader_all_columns

b0d53d0

Fix first_row_number

094ae81

Rename to first_row_index consistently, remove Option.

a5858df

vustef added 3 commits October 23, 2025 15:25

revert parquet-testing update

5e7d9a1

Fix baselines in file::metadata::tests::test_memory_size

54c22c6

Fix encryption metadata and async tests. Those features and default f…

f05d470

…eature tests pass

etseidl marked this pull request as draft October 25, 2025 20:17

vustef mentioned this pull request Oct 27, 2025

General virtual columns support + row numbers as a first use-case #8715

Draft

	/// Returns the first row number in this row group.
	/// Returns the global index number for the first row in this row group.

Add support for file row numbers in Parquet readers #7307

Are you sure you want to change the base?

Add support for file row numbers in Parquet readers #7307

Uh oh!

Conversation

jkylling commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

etseidl commented Mar 25, 2025

Uh oh!

jkylling commented Mar 26, 2025

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkylling Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 28, 2025

Uh oh!

jkylling commented Apr 9, 2025

Uh oh!

etseidl commented Apr 9, 2025

Uh oh!

alamb commented Apr 10, 2025

Uh oh!

jkylling commented Apr 12, 2025

Uh oh!

etseidl commented Apr 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich commented Apr 16, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkylling commented Mar 18, 2025 •

edited

Loading

jkylling Mar 28, 2025 •

edited

Loading