-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Fix nullability of return value of array_agg #11093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix nullability of return value of array_agg #11093
Conversation
| Expr::AggregateFunction(AggregateFunction { func_def, .. }) => { | ||
| match func_def { | ||
| AggregateFunctionDefinition::BuiltIn(fun) => fun.nullable(), | ||
| // TODO: UDF should be able to customize nullability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about whether we need to customize nullability, can we get the nullabiity based on expression in the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need it, we can add field to UDAFImpl, so they can define their own field. state_field is already customizable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about whether we need to customize nullability, can we get the nullabiity based on expression in the function?
I think probably ArrayAgg is the function that we need the nullability be false 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about whether we need to customize nullability, can we get the nullabiity based on expression in the function?
But that will need to get it from the specific UDAF implementation right?
If we need it, we can add field to UDAFImpl, so they can define their own field. state_field is already customizable
That would resolve it.
I think probably ArrayAgg is the function that we need the nullability be
false🤔
There is also count and the the count like e.g approx_distinct etc.
jayzhan211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how this would look if we simply always made ScalarValue::List be nullable ?
Specifically is there a compelling usecase for specifying between nullable and non nullable lists? I am pretty sure none of the arrow kernels look at the nullability of the data type (they look at the actual null count of the actual data)
If we could make lists always nullable I suspect this PR could be made much simplar (and avoid the breaking API change)
I think it makes sense |
Feels like I might be missing what you mean by this. Because If you mean to make the field inside the list nullable, my understanding is that this is basically the state of the code today? And that makes it not possible/hard to correctly propagate the nullability of expressions like
To clarify it is not the list that this fixes the nullable for, but the elements inside the list.
I think this might be true. But my thinking was that the nullability of columns is leveraged in the optimizer. Here for example is a transformation that uses the nullability: datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs Lines 793 to 800 in c7ac8b8
so by maintaining the correct nullability we will allow for more simplifications.
If we want to avoid breaking the API it could be easily be done by adding new methods instead of changing the old ones. The reason I opted for breaking the API is that I suspect that each place we use the current API is a place where we are currently not correctly propagating the nullability. |
|
I think the datafusion/datafusion/expr/src/expr_schema.rs Lines 317 to 323 in d44c7f2
The |
This is something that is fixed/changed in this PR see: datafusion/datafusion/expr/src/expr_schema.rs Lines 325 to 331 in f4c7da3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we are able to optimize logical plan based on whether the list (function return type in general) is null or not.
|
Thanks @eejbyfeldt @alamb . I think we could move on with this. Let's create a ticket to address UDAF nullability customization |
Which issue does this PR close?
Closes #11094.
Rationale for this change
Currently the nullability of the value of the return type for the array_agg aggregate is wrong in two ways. Firstly the field inside the returned list is always nullable regardless of the input type and the actual list is only nullable depending on the input value. The correct behavior should be that the field in the list should depend on the input values nullabillity and the list field should not be nullable as the aggregate never returns null (it returns empty list instead).
What changes are included in this PR?
In order to fix the nullability several changes are needed.
For the logical expr it is currently hard coded that all aggregates return nullable values, this is changed to be something that looked up based on the actual functions used. For UDAF we will need to add a new customization point in the future to handle this case.
The return_type method for builtin aggregates got an additional argument
&[bool]with the nullability of the the inputs. Without this is not possible to implement the function with correct nullability or array_agg.Helper methods ScalarValue::new_list and
array_into_list_arraycurrently always makes the field nullable. To make these more usable I added a nullable argument to them and renamed the existing version with a_nullableprefix.Are these changes tested?
Existing tests.
Are there any user-facing changes?
Yes, this will change the nullability of array_aggs. And it changes the interface for some public methods
ScalarValue::new_list,ScalarValue::new_list_from_iterandarray_into_list_arrayto take anullable: boolargument.Are these API breakages acceptable? Or should we leave the existing APIs as is and introduce new methods to fix the issue?