Skip to content

Conversation

@eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Jun 24, 2024

Which issue does this PR close?

Closes #11094.

Rationale for this change

Currently the nullability of the value of the return type for the array_agg aggregate is wrong in two ways. Firstly the field inside the returned list is always nullable regardless of the input type and the actual list is only nullable depending on the input value. The correct behavior should be that the field in the list should depend on the input values nullabillity and the list field should not be nullable as the aggregate never returns null (it returns empty list instead).

What changes are included in this PR?

In order to fix the nullability several changes are needed.

  1. For the logical expr it is currently hard coded that all aggregates return nullable values, this is changed to be something that looked up based on the actual functions used. For UDAF we will need to add a new customization point in the future to handle this case.

  2. The return_type method for builtin aggregates got an additional argument &[bool] with the nullability of the the inputs. Without this is not possible to implement the function with correct nullability or array_agg.

  3. Helper methods ScalarValue::new_list and array_into_list_array currently always makes the field nullable. To make these more usable I added a nullable argument to them and renamed the existing version with a _nullable prefix.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

Yes, this will change the nullability of array_aggs. And it changes the interface for some public methods ScalarValue::new_list, ScalarValue::new_list_from_iter and array_into_list_array to take a nullable: bool argument.

Are these API breakages acceptable? Or should we leave the existing APIs as is and introduce new methods to fix the issue?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate substrait Changes to the substrait crate labels Jun 24, 2024
@eejbyfeldt eejbyfeldt marked this pull request as ready for review June 24, 2024 08:14
Expr::AggregateFunction(AggregateFunction { func_def, .. }) => {
match func_def {
AggregateFunctionDefinition::BuiltIn(fun) => fun.nullable(),
// TODO: UDF should be able to customize nullability
Copy link
Contributor

@jayzhan211 jayzhan211 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about whether we need to customize nullability, can we get the nullabiity based on expression in the function?

Copy link
Contributor

@jayzhan211 jayzhan211 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need it, we can add field to UDAFImpl, so they can define their own field. state_field is already customizable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about whether we need to customize nullability, can we get the nullabiity based on expression in the function?

I think probably ArrayAgg is the function that we need the nullability be false 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about whether we need to customize nullability, can we get the nullabiity based on expression in the function?

But that will need to get it from the specific UDAF implementation right?

If we need it, we can add field to UDAFImpl, so they can define their own field. state_field is already customizable

That would resolve it.

I think probably ArrayAgg is the function that we need the nullability be false 🤔

There is also count and the the count like e.g approx_distinct etc.

jayzhan211
jayzhan211 previously approved these changes Jun 24, 2024
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how this would look if we simply always made ScalarValue::List be nullable ?

Specifically is there a compelling usecase for specifying between nullable and non nullable lists? I am pretty sure none of the arrow kernels look at the nullability of the data type (they look at the actual null count of the actual data)

If we could make lists always nullable I suspect this PR could be made much simplar (and avoid the breaking API change)

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 25, 2024

I wonder how this would look if we simply always made ScalarValue::List be nullable ?

Specifically is there a compelling usecase for specifying between nullable and non nullable lists? I am pretty sure none of the arrow kernels look at the nullability of the data type (they look at the actual null count of the actual data)

If we could make lists always nullable I suspect this PR could be made much simplar (and avoid the breaking API change)

I think it makes sense

@eejbyfeldt
Copy link
Contributor Author

eejbyfeldt commented Jun 25, 2024

I wonder how this would look if we simply always made ScalarValue::List be nullable ?

Feels like I might be missing what you mean by this. Because If you mean to make the field inside the list nullable, my understanding is that this is basically the state of the code today? And that makes it not possible/hard to correctly propagate the nullability of expressions like array_agg.

Specifically is there a compelling usecase for specifying between nullable and non nullable lists?

To clarify it is not the list that this fixes the nullable for, but the elements inside the list.

I am pretty sure none of the arrow kernels look at the nullability of the data type (they look at the actual null count of the actual data)

I think this might be true. But my thinking was that the nullability of columns is leveraged in the optimizer. Here for example is a transformation that uses the nullability:

// A OR !A ---> true (if A not nullable)
Expr::BinaryExpr(BinaryExpr {
left,
op: Or,
right,
}) if is_not_of(&right, &left) && !info.nullable(&left)? => {
Transformed::yes(Expr::Literal(ScalarValue::Boolean(Some(true))))
}

so by maintaining the correct nullability we will allow for more simplifications.

If we could make lists always nullable I suspect this PR could be made much simplar (and avoid the breaking API change)

If we want to avoid breaking the API it could be easily be done by adding new methods instead of changing the old ones. The reason I opted for breaking the API is that I suspect that each place we use the current API is a place where we are currently not correctly propagating the nullability.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 25, 2024

I think the nullability we get in the logical optimize rule you mentioned is actually always true for AggregateFuncion

Expr::ScalarVariable(_, _)
| Expr::TryCast { .. }
| Expr::ScalarFunction(..)
| Expr::WindowFunction { .. }
| Expr::AggregateFunction { .. }
| Expr::Unnest(_)
| Expr::Placeholder(_) => Ok(true),

The field we changed is in the physical plan layer. The tradeoff is only losing the possibility of physical plan optimization and maybe arrow kernel 🤔 .

@jayzhan211 jayzhan211 dismissed their stale review June 25, 2024 09:53

Need for discussion

@eejbyfeldt
Copy link
Contributor Author

I think the nullability we get in the logical optimize rule you mentioned is actually always true for AggregateFuncion

Expr::ScalarVariable(_, _)
| Expr::TryCast { .. }
| Expr::ScalarFunction(..)
| Expr::WindowFunction { .. }
| Expr::AggregateFunction { .. }
| Expr::Unnest(_)
| Expr::Placeholder(_) => Ok(true),

This is something that is fixed/changed in this PR see:

Expr::AggregateFunction(AggregateFunction { func_def, .. }) => {
match func_def {
AggregateFunctionDefinition::BuiltIn(fun) => fun.nullable(),
// TODO: UDF should be able to customize nullability
AggregateFunctionDefinition::UDF(_) => Ok(true),
}
}

@eejbyfeldt eejbyfeldt requested review from alamb and jayzhan211 June 25, 2024 13:02
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we are able to optimize logical plan based on whether the list (function return type in general) is null or not.

@jayzhan211
Copy link
Contributor

Thanks @eejbyfeldt @alamb . I think we could move on with this.

Let's create a ticket to address UDAF nullability customization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nullability of array_agg

3 participants