arrow-row: Add support for REE #7649

brancz · 2025-06-12T14:43:45Z

Which issue does this PR close?

Part 2 of apache/datafusion#16011

Are there any user-facing changes?

No user facing changes, just extending functionality of existing APIs to support extracting rows from REE arrays.

@alamb

alamb

Thank you for this contribution @brancz -- this is looking like a great start.

I think the only thing this PR needs prior to merge is

More tests (I listed a bunch of suggestions)
Documentation of the format used for REE arrays along side the documentation for other types (e.g. here)

I left some other suggestions but I don't think they are needed in this PR (we can do them in a follow on PR or never)

alamb · 2025-06-12T21:11:32Z

arrow-row/src/lib.rs


 mod fixed;
 mod list;
+mod run;


I double checked and run is consistent with the naming of REEArray elsewhere in the crate 👍

alamb · 2025-06-12T21:13:15Z

arrow-row/src/lib.rs

                Ok(Self::Dictionary(converter, owned))
            }
+            DataType::RunEndEncoded(_, values) => {
+                // Similar to List implementation


Maybe we can pull the transformation into a documented helper function (not needed, I just was confused for a bit until I read the comments in the List/LargeList implementation

alamb · 2025-06-12T21:17:13Z

arrow-row/src/run_test.rs

@@ -0,0 +1,55 @@
+// Licensed to the Apache Software Foundation (ASF) under one


I think it is more standard in this repo to put unit tests like this in the same module (aka I would expect this to be in arrow-row/src/run.rs

Is there any reason it is in a different module?

no good reason, moved them into run.rs

done in 4f9b8f3

alamb · 2025-06-12T21:18:16Z

arrow-row/src/run.rs

+
+/// Encodes the provided `RunEndEncodedArray` to `out` with the provided `SortOptions`
+///
+/// `rows` should contain the encoded values


it would be really helpful if the format that this was creating was documented somewhere (so I don't have to reverse engineer it from the code to try and double check it)

alamb · 2025-06-12T21:23:36Z

arrow-row/src/run.rs

+    array: &RunArray<R>,
+) {
+    for (idx, offset) in offsets.iter_mut().skip(1).enumerate() {
+        let physical_idx = array.get_physical_index(idx);


This code is effectively going to deocde the REE array (though I don't really see any way around that)

I do think you could make it more efficient by iterating over each run and then copying the value in a loop (rather than this which does a binary search on the run ends for each idx

good catch, I don't know why I wasn't thinking of that

turns out it's actually also more readable that way

done in 9c61fe9

alamb · 2025-06-12T21:24:36Z

arrow-row/src/run.rs

+    let opts = field.options;
+
+    // Track null values and collect row data to avoid borrow issues
+    let mut valid_flags = Vec::with_capacity(rows.len());


I think you could use BooleanBufferBuilder here which might be more efficient

Makes sense!

done in e39428f

alamb · 2025-06-12T21:25:38Z

arrow-row/src/run.rs

+    }
+
+    // Convert collected values to arrays
+    let mut values_rows = values_data.clone();


is this clone necesssary? It seems like values_data isn't used afterwards

Walking borrow-checker, you're right!

done in 895598e

alamb · 2025-06-12T21:26:24Z

arrow-row/src/run.rs

+    // Get the count of elements before we move the vector
+    let element_count = run_ends.len();
+    let buffer = Buffer::from_vec(run_ends);
+    let run_ends_array = arrow_array::PrimitiveArray::<R>::new(


I think you could just do PrimitiveArray::<R>::from(run_ends (which will internally do the buffer dance you are doing)

I feel like I tried something like that, but then couldn't because of type foo. Let me try again.

Maybe I'm still missing something but I managed to still simplify it by using ScalarBuffer::from instead of the raw buffer handling.

done in 3ad1e91

alamb · 2025-06-12T21:28:28Z

arrow-row/src/run.rs

+    );
+
+    // Update rows to consume the data we've processed
+    for i in 0..rows.len() {


I don't fully understand the consumption code, but it seems consistent with the ListArray implemenation

In hindsight it made no sense whatsoever, I shouldn't rely on my >1 month old code being correct anymore 🙃

alamb · 2025-06-12T21:31:12Z

arrow-row/src/run_test.rs

+        assert!(rows2.row(0) < rows1.row(0));
+        assert!(rows2.row(1) < rows1.row(0));
+        assert!(rows1.row(2) < rows2.row(2));
+    }


I think we should add a few more tests;

Round tripping (e.g that converting rows1 to an Array results in an array that is equal to run_array1 -- same for rows2)

Encoding / Decoding REE Arrays that have nulls

Encoding/Decding REE arrays of some other value type (Int64Array for example)

Descending / nulls first/last sort orders

Good call, there were a lot of problems. Refactoring a bit then pushing the latest changes.

brancz · 2025-06-13T15:10:48Z

@alamb you can ignore all intermediate commits, the last commit pretty much re-writes everything and I think is also way more readable and understandable (and most of all correct).

Please take a look!

alamb

Thank you @brancz -- I spent some time reviewing and testing this PR and it looks good to me. I have some suggestions on reducing the duplication in tests, but I don't think that is needed (but would appreciate a follow on PR to do it!)

I think we need test coverage for the following cases (I'll make a PR)

Empty arrays
REE arrays with Int16 and Int64 types (that code is not covered I don't think)

alamb · 2025-06-16T19:58:07Z

arrow-row/src/run.rs

+            let out = &mut data[*offset..];
+
+            // Use variable-length encoding to make the data self-describing
+            let row = rows.row(physical_idx);


Some random performance optimization thoughts (for some future PR):

You could hoist this out of the inner loop so it was executed once per physical value rather than once per logical value

You could potentially encode row once and then simply copy the encoded bytes for all remaining rows. This is probably significantly faster than re-encoding the same value over and over again.

Tracking with

Improve performance of RunArray --> Row conversion #7693

alamb · 2025-06-16T20:07:14Z

arrow-row/src/run.rs

+            .downcast_ref::<RunArray<Int32Type>>()
+            .unwrap();
+
+        assert_eq!(array.run_ends().values(), result.run_ends().values());


I found it strange that this test didn't just test that array and result were equal

assert_eq!(array, result);

So I tried it locally, and it seems that it doesn't implement PartialEq

Maybe something we can add in a follow on PR

error[E0369]: binary operation `==` cannot be applied to type `RunArray<arrow_array::types::Int32Type>` --> arrow-row/src/run.rs:203:9 | 203 | assert_eq!(array, result); | ^^^^^^^^^^^^^^^^^^^^^^^^^ | | | RunArray<arrow_array::types::Int32Type> | &RunArray<arrow_array::types::Int32Type> | note: the foreign item type `RunArray<arrow_array::types::Int32Type>` doesn't implement `PartialEq<&RunArray<arrow_array::types::Int32Type>>` --> /Users/andrewlamb/Software/arrow-rs/arrow-array/src/array/run_array.rs:63:1 | 63 | pub struct RunArray<R: RunEndIndexType> { | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ not implement `PartialEq<&RunArray<arrow_array::types::Int32Type>>` = note: this error originates in the macro `assert_eq` (in Nightly builds, run with -Z macro-backtrace for more info)

Implement PartialEq for RunArray #7691

alamb · 2025-06-16T20:08:59Z

arrow-row/src/run.rs

+
+        let array: RunArray<Int32Type> = vec!["b", "b", "a"].into_iter().collect();
+
+        let converter = RowConverter::new(vec![SortField::new(DataType::RunEndEncoded(


These test have a lot of boiler plate. Maybe we could make a function like assert_roundtrip(array: RunArray<..>) that captures the common pattern

Not necessary, just something I noticed while reviewing

Reduce repetition in tests for arrow-row/src/run.rs #7692

alamb · 2025-06-16T20:11:24Z

arrow-row/src/run.rs

+            .unwrap();
+
+        // Convert back to verify both configurations work
+        let result_test_asc = converter_asc.convert_rows(&rows_test_asc).unwrap();


I do think having a "assert_roundtrip" type function would make it clearer what was being tested here and would also make it easier to verify that the values were the same as well)

alamb · 2025-06-16T20:16:07Z

arrow-row/src/run.rs

+        let result = arrays[0]
+            .as_any()
+            .downcast_ref::<RunArray<Int32Type>>()
+            .unwrap();


You can achieve the same thing with a little less code like this if you want

Suggested change

let result = arrays[0]

.as_any()

.downcast_ref::<RunArray<Int32Type>>()

.unwrap();

let result = arrays[0].as_run::<Int32Type>();

alamb · 2025-06-16T20:43:37Z

I think we need test coverage for the following cases (I'll make a PR)

Here is a PR:

Document REE row format and add some more tests #7680

alamb · 2025-06-17T19:05:46Z

I plan to merge this PR in and then file follow on issues for items found during review

adding partial eq to REE
refactor the round trip tests a bit to deduplicate code
only decode each physical value once

alamb · 2025-06-17T19:05:59Z

Thanks again @brancz

@brancz

~Draft until apache#7649 is merged~ # Which issue does this PR close? - Follow on to apache#7649 from @brancz # Rationale for this change I noticed some extra testing and docs I would like to see so I made a PR to add them # What changes are included in this PR? 1. Add docs + additional tests # Are there any user-facing changes? No code changes, only some docs (and more tests)

arrow-row: Add support for REE

3c71fab

github-actions bot added the arrow Changes to the arrow crate label Jun 12, 2025

brancz mentioned this pull request May 12, 2025

Support Aggregating by RunArrays apache/datafusion#16011

Open

4 tasks

alamb reviewed Jun 12, 2025

View reviewed changes

brancz added 6 commits June 13, 2025 10:48

arrow-row: Decode REE by decoding each run fully instead of one by one

9c61fe9

arrow-row: Use BooleanBufferBuilder instead of raw Vec when decoding

e39428f

arrow-row: Remove unnecessary clone

895598e

arrow-row: Use ScalarBuffer::from instead of raw buffers

3ad1e91

arrow-row: Move REE tests from separate file next to implementation

4f9b8f3

arrow-row: Add more REE tests and fix various bugs

ccfbd7b

alamb approved these changes Jun 16, 2025

View reviewed changes

alamb mentioned this pull request Jun 16, 2025

Document REE row format and add some more tests #7680

Merged

alamb merged commit 3837ac0 into apache:main Jun 17, 2025
26 checks passed

brancz deleted the arrow-row-ree branch June 17, 2025 19:28

brancz mentioned this pull request Jul 24, 2025

Release arrow-rs / parquet Major version 56.0.0 (July 2025) #7395

Closed

14 tasks

		@@ -0,0 +1,55 @@
		// Licensed to the Apache Software Foundation (ASF) under one


		let array: RunArray<Int32Type> = vec!["b", "b", "a"].into_iter().collect();

		let converter = RowConverter::new(vec![SortField::new(DataType::RunEndEncoded(

arrow-row: Add support for REE #7649

arrow-row: Add support for REE #7649

Uh oh!

Conversation

brancz commented Jun 12, 2025

Which issue does this PR close?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brancz commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

brancz commented Jun 13, 2025 •

edited

Loading

alamb Jun 17, 2025 •

edited

Loading