-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: convert_array_to_scalar_vec respects null elements #17891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: convert_array_to_scalar_vec respects null elements #17891
Conversation
9c6db36 to
b1bb29c
Compare
217287c to
35358d6
Compare
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vegarsti I would love to see if there any performance degradations, you can find benches in the project.
Maybe we can have a separate test for this issues?
Good call, I will do both. Thanks! |
This comment was marked as outdated.
This comment was marked as outdated.
35358d6 to
e232719
Compare
I added two test cases to show the change in this PR. On main these both return an empty vector, but with this PR the nullability is preserved, so in the first one (4) it's |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do plan to take a closer look at this, specifically around understanding the impact to array_agg, nth_value and array_has 👍
I'll also mark this as API change since convert_array_to_scalar_vec is technically part of the public API
| let list = scalar_values | ||
| .into_iter() | ||
| .flatten() | ||
| .flatten() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For here, it might be more correct to do it like so:
if let Ok(scalar_values) =
ScalarValue::convert_array_to_scalar_vec(&array)
{
assert_eq!(scalar_values.len(), 1);
match &scalar_values[0] {
// Haystack was a single list element as expected
Some(list) => {
let list = list
.iter()
.map(|v| Expr::Literal(v.clone(), None))
.collect();
return Ok(ExprSimplifyResult::Simplified(in_list(
std::mem::take(needle),
list,
false,
)));
}
// Haystack was a singular null, should be handled elsewhere
None => {
return Ok(ExprSimplifyResult::Original(args));
}
};
}- Instead of flattening
Noneinto an empty vec, we handle it separately.
Example of an accompanying test:
#[test]
fn test123() {
let haystack = ListArray::new_null(
Arc::new(Field::new_list_field(DataType::Int32, true)),
1,
);
let haystack = lit(ScalarValue::List(Arc::new(haystack)));
let needle = col("c");
let props = ExecutionProps::new();
let context = datafusion_expr::simplify::SimplifyContext::new(&props);
let Ok(ExprSimplifyResult::Original(args)) =
ArrayHas::new().simplify(vec![haystack.clone(), needle.clone()], &context)
else {
panic!("Expected non-simplified expression");
};
assert_eq!(args, vec![haystack, col("c")]);
}We can see it's essentially like doing array_has(null, 1) and handling this case in the simplifier; we don't try to simplify it as I don't think there's a way to pass the null as the haystack to in_list().
However, it's likely some other code path handles this null propagation already so that's why tests didn't fail. I prefer the robust (albeit more verbose) approach here (to me to reads easier), but we can stick with what we have here (preferably calling it out via comments?) either way.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, very good. I agree that the robust approach is better! Thanks for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included both - thanks a lot. I didn't change anything because this felt very clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like performance took a hit after introducing this 😭 #17891 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get the performance back, I reverted back to adding a second flatten() call but with some comments (and the test is there and still passes), in d9aa726. Does that sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too bothered by this honestly, mainly because its not really on a hot path (I think) and not sure how often people might look at this part of code anyway; but feel free to take a stab at refactoring if you feel like it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, are you saying you don't think it's worth adding your suggested fix because it's code people don't look at much, or that it's fine to add it even if it is slightly less performant because it's not in the hot path? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to add this ^ with the test in a patch 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I wasn't clear; I meant the former, since I don't think the benchmarks actually tested that bit of code (coupled with your remark about benchmarks being a bit noisy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
e232719 to
b51c66f
Compare
This comment was marked as outdated.
This comment was marked as outdated.
|
Reverted back to using |
b022f1e to
bf60995
Compare
bf60995 to
633ab64
Compare
|
Thanks @vegarsti |
Which issue does this PR close?
ScalarValue::convert_array_to_scalar_vecdoesn't respect null list elements #17749.What changes are included in this PR?
Makes
convert_array_to_scalar_vecreturnNoneinstead of the empty list when list elements are null.Are these changes tested?
Covered by existing tests and adds new test cases.
Are there any user-facing changes?
Change to return type of public function
ScalarValue::convert_array_to_scalar_vec()