Skip to content

Conversation

@brancz
Copy link
Contributor

@brancz brancz commented Jun 12, 2025

Which issue does this PR close?

Part 2 of apache/datafusion#16011

Are there any user-facing changes?

No user facing changes, just extending functionality of existing APIs to support extracting rows from REE arrays.

@alamb

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 12, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @brancz -- this is looking like a great start.

I think the only thing this PR needs prior to merge is

  1. More tests (I listed a bunch of suggestions)
  2. Documentation of the format used for REE arrays along side the documentation for other types (e.g. here)

I left some other suggestions but I don't think they are needed in this PR (we can do them in a follow on PR or never)


mod fixed;
mod list;
mod run;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked and run is consistent with the naming of REEArray elsewhere in the crate 👍

Ok(Self::Dictionary(converter, owned))
}
DataType::RunEndEncoded(_, values) => {
// Similar to List implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can pull the transformation into a documented helper function (not needed, I just was confused for a bit until I read the comments in the List/LargeList implementation

@@ -0,0 +1,55 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more standard in this repo to put unit tests like this in the same module (aka I would expect this to be in arrow-row/src/run.rs

Is there any reason it is in a different module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no good reason, moved them into run.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 4f9b8f3


/// Encodes the provided `RunEndEncodedArray` to `out` with the provided `SortOptions`
///
/// `rows` should contain the encoded values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be really helpful if the format that this was creating was documented somewhere (so I don't have to reverse engineer it from the code to try and double check it)

array: &RunArray<R>,
) {
for (idx, offset) in offsets.iter_mut().skip(1).enumerate() {
let physical_idx = array.get_physical_index(idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is effectively going to deocde the REE array (though I don't really see any way around that)

I do think you could make it more efficient by iterating over each run and then copying the value in a loop (rather than this which does a binary search on the run ends for each idx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I don't know why I wasn't thinking of that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out it's actually also more readable that way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 9c61fe9

let opts = field.options;

// Track null values and collect row data to avoid borrow issues
let mut valid_flags = Vec::with_capacity(rows.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use BooleanBufferBuilder here which might be more efficient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in e39428f

}

// Convert collected values to arrays
let mut values_rows = values_data.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this clone necesssary? It seems like values_data isn't used afterwards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walking borrow-checker, you're right!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 895598e

// Get the count of elements before we move the vector
let element_count = run_ends.len();
let buffer = Buffer::from_vec(run_ends);
let run_ends_array = arrow_array::PrimitiveArray::<R>::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just do PrimitiveArray::<R>::from(run_ends (which will internally do the buffer dance you are doing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I tried something like that, but then couldn't because of type foo. Let me try again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm still missing something but I managed to still simplify it by using ScalarBuffer::from instead of the raw buffer handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 3ad1e91

);

// Update rows to consume the data we've processed
for i in 0..rows.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the consumption code, but it seems consistent with the ListArray implemenation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight it made no sense whatsoever, I shouldn't rely on my >1 month old code being correct anymore 🙃

assert!(rows2.row(0) < rows1.row(0));
assert!(rows2.row(1) < rows1.row(0));
assert!(rows1.row(2) < rows2.row(2));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a few more tests;

  1. Round tripping (e.g that converting rows1 to an Array results in an array that is equal to run_array1 -- same for rows2)
  2. Encoding / Decoding REE Arrays that have nulls
  3. Encoding/Decding REE arrays of some other value type (Int64Array for example)
  4. Descending / nulls first/last sort orders

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, there were a lot of problems. Refactoring a bit then pushing the latest changes.

@brancz
Copy link
Contributor Author

brancz commented Jun 13, 2025

@alamb you can ignore all intermediate commits, the last commit pretty much re-writes everything and I think is also way more readable and understandable (and most of all correct).

Please take a look!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @brancz -- I spent some time reviewing and testing this PR and it looks good to me. I have some suggestions on reducing the duplication in tests, but I don't think that is needed (but would appreciate a follow on PR to do it!)

I think we need test coverage for the following cases (I'll make a PR)

  1. Empty arrays
  2. REE arrays with Int16 and Int64 types (that code is not covered I don't think)

let out = &mut data[*offset..];

// Use variable-length encoding to make the data self-describing
let row = rows.row(physical_idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some random performance optimization thoughts (for some future PR):

  1. You could hoist this out of the inner loop so it was executed once per physical value rather than once per logical value
  2. You could potentially encode row once and then simply copy the encoded bytes for all remaining rows. This is probably significantly faster than re-encoding the same value over and over again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.downcast_ref::<RunArray<Int32Type>>()
.unwrap();

assert_eq!(array.run_ends().values(), result.run_ends().values());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it strange that this test didn't just test that array and result were equal

        assert_eq!(array, result);

So I tried it locally, and it seems that it doesn't implement PartialEq

Maybe something we can add in a follow on PR

error[E0369]: binary operation `==` cannot be applied to type `RunArray<arrow_array::types::Int32Type>`
   --> arrow-row/src/run.rs:203:9
    |
203 |         assert_eq!(array, result);
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^
    |         |
    |         RunArray<arrow_array::types::Int32Type>
    |         &RunArray<arrow_array::types::Int32Type>
    |
note: the foreign item type `RunArray<arrow_array::types::Int32Type>` doesn't implement `PartialEq<&RunArray<arrow_array::types::Int32Type>>`
   --> /Users/andrewlamb/Software/arrow-rs/arrow-array/src/array/run_array.rs:63:1
    |
63  | pub struct RunArray<R: RunEndIndexType> {
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ not implement `PartialEq<&RunArray<arrow_array::types::Int32Type>>`
    = note: this error originates in the macro `assert_eq` (in Nightly builds, run with -Z macro-backtrace for more info)

Copy link
Contributor

@alamb alamb Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


let array: RunArray<Int32Type> = vec!["b", "b", "a"].into_iter().collect();

let converter = RowConverter::new(vec![SortField::new(DataType::RunEndEncoded(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test have a lot of boiler plate. Maybe we could make a function like assert_roundtrip(array: RunArray<..>) that captures the common pattern

Not necessary, just something I noticed while reviewing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.unwrap();

// Convert back to verify both configurations work
let result_test_asc = converter_asc.convert_rows(&rows_test_asc).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think having a "assert_roundtrip" type function would make it clearer what was being tested here and would also make it easier to verify that the values were the same as well)

Comment on lines +196 to +199
let result = arrays[0]
.as_any()
.downcast_ref::<RunArray<Int32Type>>()
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can achieve the same thing with a little less code like this if you want

Suggested change
let result = arrays[0]
.as_any()
.downcast_ref::<RunArray<Int32Type>>()
.unwrap();
let result = arrays[0].as_run::<Int32Type>();

@alamb
Copy link
Contributor

alamb commented Jun 16, 2025

I think we need test coverage for the following cases (I'll make a PR)

Here is a PR:

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

I plan to merge this PR in and then file follow on issues for items found during review

  1. adding partial eq to REE
  2. refactor the round trip tests a bit to deduplicate code
  3. only decode each physical value once

@alamb alamb merged commit 3837ac0 into apache:main Jun 17, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

Thanks again @brancz

@brancz brancz deleted the arrow-row-ree branch June 17, 2025 19:28
alamb added a commit to PinkCrow007/arrow-rs that referenced this pull request Jun 17, 2025
~Draft until apache#7649 is merged~

# Which issue does this PR close?
- Follow on to apache#7649 from @brancz

# Rationale for this change
I noticed some extra testing and docs I would like to see so I made a PR
to add them

# What changes are included in this PR?

1. Add docs + additional tests
# Are there any user-facing changes?
No code changes, only some docs (and more tests)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants