-
Couldn't load subscription status.
- Fork 1.7k
feat: support LargeList in make_array and array_length
#8121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| fn return_large_array() -> ArrayRef { | ||
| // Returns: [1, 2, 3, 4] | ||
| let capacity = i32::MAX as usize + 10; | ||
| let args = vec![Arc::new(Int64Array::from(vec![Some(1)])) as ArrayRef; capacity]; | ||
|
|
||
| println!("args.len() = {}", args.len()); | ||
|
|
||
| make_array(&args).expect("failed to initialize function array") | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to efficiently create a LargeList using make_array as it runs out of memory on my end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you should use below constructor to construct LargeList
let list_array = Arc::new(LargeListArray::from_iter_primitive::<Int32Type, _, _>(
vec![
Some(vec![Some(0), Some(1), Some(2)]),
None,
Some(vec![Some(3), None, Some(5)]),
],
)) as ArrayRef91612a4 to
b5f9e68
Compare
| fn return_large_array() -> ArrayRef { | ||
| // Returns: [1, 2, 3, 4] | ||
| let args = [ | ||
| Arc::new(Int64Array::from(vec![Some(1)])) as ArrayRef, | ||
| Arc::new(Int64Array::from(vec![Some(2)])) as ArrayRef, | ||
| Arc::new(Int64Array::from(vec![Some(3)])) as ArrayRef, | ||
| Arc::new(Int64Array::from(vec![Some(4)])) as ArrayRef, | ||
| ]; | ||
| let data_type = DataType::Int64; | ||
| array_array::<i64>(&args, data_type).expect("failed to initialize function array") | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hack here to create a LargeList to avoid spending too much memory for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the hack?
f7bdb8f to
a06fbf8
Compare
a06fbf8 to
d340ff7
Compare
|
@alamb @jayzhan211 @Veeupup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @Weijun-H -- the basic structure of most of this PR looks good to me -- thank you. I left some comments, but I think it is pretty close.
Perhaps @jayzhan211 has some additional thoughts to share
datafusion/common/src/scalar.rs
Outdated
| Self::iter_to_array(values.iter().cloned()).unwrap() | ||
| }; | ||
| Arc::new(array_into_list_array(values)) | ||
| if values.len() <= i32::MAX as usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ScalarValue type needs to match the underyling array type -- so with this change, a ScalarValue::List might return LargeListArray or ListArray. This mismatch will likely cause issues downstream
I think the standard pattern would be to add a new function ScalarValue::new_large_list and ScalarValue::LargeList if they don't already exist
| fn return_large_array() -> ArrayRef { | ||
| // Returns: [1, 2, 3, 4] | ||
| let args = [ | ||
| Arc::new(Int64Array::from(vec![Some(1)])) as ArrayRef, | ||
| Arc::new(Int64Array::from(vec![Some(2)])) as ArrayRef, | ||
| Arc::new(Int64Array::from(vec![Some(3)])) as ArrayRef, | ||
| Arc::new(Int64Array::from(vec![Some(4)])) as ArrayRef, | ||
| ]; | ||
| let data_type = DataType::Int64; | ||
| array_array::<i64>(&args, data_type).expect("failed to initialize function array") | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the hack?
| as_uint64_array(&arr).expect("failed to initialize function array_length"); | ||
|
|
||
| assert_eq!(result, &UInt64Array::from(vec![None])); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than extending this module with more rust tests, can you perhaps add SQL level tests using sqllogictest?
That will ensure the functions are usable end to end and that all the connections are in place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to create an large list in sql file. I think it's why he done the test here? @Weijun-H
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I do not know how we could construct the LargeList in sqllogictest, instead of an array with more i32:MAX by make_array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current idea is to use arrow_cast() to create a LargeList instead of make_array. This pr is stalled by #8290 @alamb @jayzhan211
| offsets.push(0); | ||
|
|
||
| let mut offsets: Vec<O> = Vec::with_capacity(total_len); | ||
| offsets.push(O::from_usize(0).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usize_as
| Ok(Arc::new(GenericListArray::<O>::try_new( | ||
| Arc::new(Field::new("item", data_type, true)), | ||
| OffsetBuffer::new(offsets.into()), | ||
| OffsetBuffer::new(ScalarBuffer::from(offsets)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need this for usize_as
| match &args[0].data_type() { | ||
| DataType::List(_) => _array_length_list::<i32>(args), | ||
| DataType::LargeList(_) => _array_length_list::<i64>(args), | ||
| _ => Err(DataFusionError::Internal(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolved
| } | ||
|
|
||
| /// array_length for List and LargeList | ||
| fn _array_length_list<O: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any other name without _? In rust _xxx usually means unused.
818685c to
6839aaa
Compare
| 3 2 | ||
|
|
||
| query II | ||
| select array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 1), array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can not test LargeList(LargeList) because arrow-rs does not support it yet. #8305
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be possible to add a test that tries to cast LargeList(List) as a follow on PR? It would, of course, error initially, but then when we upgraded arrow to a version that did support that cast, we would have the test coverage
6839aaa to
5d5d0da
Compare
| } | ||
| } | ||
| offsets.push(mutable.len() as i32); | ||
| offsets.push(O::from_usize(mutable.len()).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we dont need unwrap if casting via usize_as
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
Thank you for reviewing 👍 @jayzhan211 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
0e58ca8 to
5d678a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @Weijun-H -- I think it is very close
| DataType::Null => { | ||
| let array = new_null_array(&DataType::Null, arrays.len()); | ||
| Ok(Arc::new(array_into_list_array(array))) | ||
| if len <= i32::MAX as usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think that make_array should ever return NullArray (aka DataType::Null) and instead it should always return ListArray (possibly with a null value)
However, this PR doesn't make this situation worse, so 👍
array_length
95a8396 to
05e365a
Compare
0b0e475 to
6caf58f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @Weijun-H
| 3 2 | ||
|
|
||
| query II | ||
| select array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 1), array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be possible to add a test that tries to cast LargeList(List) as a follow on PR? It would, of course, error initially, but then when we upgraded arrow to a version that did support that cast, we would have the test coverage
|
THanks for sticking with these PRs -- I know it has taken time, but now that we have the patterns down I feel like the code is really improving at a nice rate. This is what I think successful software development and incremental improvement looks like! |
…8121) * feat: support LargeList in make_array and array_length * chore: add tests * fix: update tests for nested array * use usise_as * add new_large_list * refactor array_length * add comment * update test in sqllogictest * fix ci * fix macro * use usize_as * update comment * return based on data_type in make_array
Which issue does this PR close?
Parts #8084
Parts #8185
Rationale for this change
The current implementation of the
make_arrayandarray_lengthfunctions lack support forLargeList. Therefore, we need to enhance these functions by incorporating theOffsizeTraitto enable them to work seamlessly withLargeList. This will ensure that the functions can handle a large number of elements in the list, thus improving their overall efficiency.What changes are included in this PR?
make_arraycan beListandLargeListdepending on the array lengthLargeListinarray_lenghtarray_arrayandarray!can considerOffsizeTraitAre these changes tested?
Are there any user-facing changes?