-
Notifications
You must be signed in to change notification settings - Fork 1.1k
arrow-row: Add support for REE #7649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @brancz -- this is looking like a great start.
I think the only thing this PR needs prior to merge is
- More tests (I listed a bunch of suggestions)
- Documentation of the format used for REE arrays along side the documentation for other types (e.g. here)
I left some other suggestions but I don't think they are needed in this PR (we can do them in a follow on PR or never)
|
|
||
| mod fixed; | ||
| mod list; | ||
| mod run; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked and run is consistent with the naming of REEArray elsewhere in the crate 👍
| Ok(Self::Dictionary(converter, owned)) | ||
| } | ||
| DataType::RunEndEncoded(_, values) => { | ||
| // Similar to List implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can pull the transformation into a documented helper function (not needed, I just was confused for a bit until I read the comments in the List/LargeList implementation
arrow-row/src/run_test.rs
Outdated
| @@ -0,0 +1,55 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is more standard in this repo to put unit tests like this in the same module (aka I would expect this to be in arrow-row/src/run.rs
Is there any reason it is in a different module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no good reason, moved them into run.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 4f9b8f3
|
|
||
| /// Encodes the provided `RunEndEncodedArray` to `out` with the provided `SortOptions` | ||
| /// | ||
| /// `rows` should contain the encoded values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be really helpful if the format that this was creating was documented somewhere (so I don't have to reverse engineer it from the code to try and double check it)
arrow-row/src/run.rs
Outdated
| array: &RunArray<R>, | ||
| ) { | ||
| for (idx, offset) in offsets.iter_mut().skip(1).enumerate() { | ||
| let physical_idx = array.get_physical_index(idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is effectively going to deocde the REE array (though I don't really see any way around that)
I do think you could make it more efficient by iterating over each run and then copying the value in a loop (rather than this which does a binary search on the run ends for each idx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, I don't know why I wasn't thinking of that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out it's actually also more readable that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 9c61fe9
arrow-row/src/run.rs
Outdated
| let opts = field.options; | ||
|
|
||
| // Track null values and collect row data to avoid borrow issues | ||
| let mut valid_flags = Vec::with_capacity(rows.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use BooleanBufferBuilder here which might be more efficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in e39428f
arrow-row/src/run.rs
Outdated
| } | ||
|
|
||
| // Convert collected values to arrays | ||
| let mut values_rows = values_data.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this clone necesssary? It seems like values_data isn't used afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Walking borrow-checker, you're right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 895598e
arrow-row/src/run.rs
Outdated
| // Get the count of elements before we move the vector | ||
| let element_count = run_ends.len(); | ||
| let buffer = Buffer::from_vec(run_ends); | ||
| let run_ends_array = arrow_array::PrimitiveArray::<R>::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could just do PrimitiveArray::<R>::from(run_ends (which will internally do the buffer dance you are doing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I tried something like that, but then couldn't because of type foo. Let me try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm still missing something but I managed to still simplify it by using ScalarBuffer::from instead of the raw buffer handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 3ad1e91
arrow-row/src/run.rs
Outdated
| ); | ||
|
|
||
| // Update rows to consume the data we've processed | ||
| for i in 0..rows.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the consumption code, but it seems consistent with the ListArray implemenation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hindsight it made no sense whatsoever, I shouldn't rely on my >1 month old code being correct anymore 🙃
arrow-row/src/run_test.rs
Outdated
| assert!(rows2.row(0) < rows1.row(0)); | ||
| assert!(rows2.row(1) < rows1.row(0)); | ||
| assert!(rows1.row(2) < rows2.row(2)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a few more tests;
- Round tripping (e.g that converting rows1 to an
Arrayresults in an array that is equal torun_array1-- same forrows2) - Encoding / Decoding REE Arrays that have nulls
- Encoding/Decding REE arrays of some other value type (Int64Array for example)
- Descending / nulls first/last sort orders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, there were a lot of problems. Refactoring a bit then pushing the latest changes.
|
@alamb you can ignore all intermediate commits, the last commit pretty much re-writes everything and I think is also way more readable and understandable (and most of all correct). Please take a look! |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @brancz -- I spent some time reviewing and testing this PR and it looks good to me. I have some suggestions on reducing the duplication in tests, but I don't think that is needed (but would appreciate a follow on PR to do it!)
I think we need test coverage for the following cases (I'll make a PR)
- Empty arrays
- REE arrays with Int16 and Int64 types (that code is not covered I don't think)
| let out = &mut data[*offset..]; | ||
|
|
||
| // Use variable-length encoding to make the data self-describing | ||
| let row = rows.row(physical_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some random performance optimization thoughts (for some future PR):
- You could hoist this out of the inner loop so it was executed once per physical value rather than once per logical value
- You could potentially encode row once and then simply copy the encoded bytes for all remaining rows. This is probably significantly faster than re-encoding the same value over and over again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking with
| .downcast_ref::<RunArray<Int32Type>>() | ||
| .unwrap(); | ||
|
|
||
| assert_eq!(array.run_ends().values(), result.run_ends().values()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it strange that this test didn't just test that array and result were equal
assert_eq!(array, result);So I tried it locally, and it seems that it doesn't implement PartialEq
Maybe something we can add in a follow on PR
error[E0369]: binary operation `==` cannot be applied to type `RunArray<arrow_array::types::Int32Type>`
--> arrow-row/src/run.rs:203:9
|
203 | assert_eq!(array, result);
| ^^^^^^^^^^^^^^^^^^^^^^^^^
| |
| RunArray<arrow_array::types::Int32Type>
| &RunArray<arrow_array::types::Int32Type>
|
note: the foreign item type `RunArray<arrow_array::types::Int32Type>` doesn't implement `PartialEq<&RunArray<arrow_array::types::Int32Type>>`
--> /Users/andrewlamb/Software/arrow-rs/arrow-array/src/array/run_array.rs:63:1
|
63 | pub struct RunArray<R: RunEndIndexType> {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ not implement `PartialEq<&RunArray<arrow_array::types::Int32Type>>`
= note: this error originates in the macro `assert_eq` (in Nightly builds, run with -Z macro-backtrace for more info)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| let array: RunArray<Int32Type> = vec!["b", "b", "a"].into_iter().collect(); | ||
|
|
||
| let converter = RowConverter::new(vec![SortField::new(DataType::RunEndEncoded( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test have a lot of boiler plate. Maybe we could make a function like assert_roundtrip(array: RunArray<..>) that captures the common pattern
Not necessary, just something I noticed while reviewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .unwrap(); | ||
|
|
||
| // Convert back to verify both configurations work | ||
| let result_test_asc = converter_asc.convert_rows(&rows_test_asc).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think having a "assert_roundtrip" type function would make it clearer what was being tested here and would also make it easier to verify that the values were the same as well)
| let result = arrays[0] | ||
| .as_any() | ||
| .downcast_ref::<RunArray<Int32Type>>() | ||
| .unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can achieve the same thing with a little less code like this if you want
| let result = arrays[0] | |
| .as_any() | |
| .downcast_ref::<RunArray<Int32Type>>() | |
| .unwrap(); | |
| let result = arrays[0].as_run::<Int32Type>(); |
Here is a PR: |
|
I plan to merge this PR in and then file follow on issues for items found during review
|
|
Thanks again @brancz |
~Draft until apache#7649 is merged~ # Which issue does this PR close? - Follow on to apache#7649 from @brancz # Rationale for this change I noticed some extra testing and docs I would like to see so I made a PR to add them # What changes are included in this PR? 1. Add docs + additional tests # Are there any user-facing changes? No code changes, only some docs (and more tests)
Which issue does this PR close?
Part 2 of apache/datafusion#16011
Are there any user-facing changes?
No user facing changes, just extending functionality of existing APIs to support extracting rows from REE arrays.
@alamb