Skip to content

Conversation

@shehabgamin
Copy link
Contributor

@shehabgamin shehabgamin commented Jan 24, 2025

Which issue does this PR close?

Closes #14230

Rationale for this change

A bug was introduced in DataFusion v43.0.0 that affects type coercion for UDF arguments. Sail's tests uncovered several of these regressions, which required explicit casting in multiple areas as a workaround during the upgrade to DataFusion 43.0.0.

The regressions identified by Sail's tests include the following functions:

  1. ascii
  2. bit_length
  3. contains
  4. ends_with
  5. starts_with
  6. octet_length

Upon digging into the code, I discovered the following:

  1. The above functions didn't have Signature::coercible.
  2. Signature::coercible was incomplete. Coercion would only happen if logical_type == target_type, logical_type == NativeType::Null, or target_type.is_integer() && logical_type.is_integer().
    1. It seems like this was unintentional because the doc comments specified behavior that was not consistent with the logic implemented:

      For example, Coercible(vec![logical_float64()]) accepts arguments like vec![Int32] or vec![Float32] since i32 and f32 can be cast to f64.

What changes are included in this PR?

  1. Update functions from the list above to have Signature::user_defined

Are these changes tested?

Yes.

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates functions Changes to functions implementation labels Jan 24, 2025
signature: Signature::one_of(
vec![
TypeSignature::String(1),
TypeSignature::Coercible(vec![TypeSignatureClass::Native(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use coercible(string), we don't need string since it is a more strict rule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I initially did, but after testing on Sail, I discovered new test failures related to coercing input that's all String (e.g. func(Utf8, Utf8View)).

The plan is to port all the relevant tests from Sail into this PR!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it was not my intention to apply this pattern on single arg functions. I'll get that fixed!

Copy link
Contributor

@jayzhan211 jayzhan211 Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design for coercion may still have room for improvement. It would be beneficial to represent the function signature in a simpler and more concise manner, rather than relying on complex combinations of multiple, similar signatures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I'll add that in.

@shehabgamin shehabgamin changed the title Fix DF 43 Regression: Coerce Various Scalar Func Args to String Fix DF 43 Coercion Regression: Improve TypeSignature::Coercible Jan 25, 2025
@github-actions github-actions bot added the common Related to common crate label Jan 25, 2025
@shehabgamin shehabgamin changed the title Fix DF 43 Coercion Regression: Improve TypeSignature::Coercible Fix DF 43 Type Coercion Regression: Improve TypeSignature::Coercible Jan 25, 2025
@shehabgamin shehabgamin changed the title Fix DF 43 Type Coercion Regression: Improve TypeSignature::Coercible Fix Type Coercion for UDF Arguments Jan 25, 2025
@shehabgamin shehabgamin mentioned this pull request Jan 25, 2025
32 tasks
@github-actions github-actions bot added the optimizer Optimizer rules label Jan 25, 2025
}

#[test]
fn test_ascii_expr() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred to place the various UDF tests within their respective files, but I couldn't due to circular dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ending up putting the tests in the .slt file, but figured we can still leave this test here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this test if the purpose of the test is covered already in slt

@shehabgamin shehabgamin changed the base branch from main to branch-45 February 4, 2025 04:36
@shehabgamin
Copy link
Contributor Author

shehabgamin commented Feb 4, 2025

@shehabgamin #14440 I come out a flexible version of Signature::CoercibleV2 (temporary name), it can replace Signature::Coercible, Signature::String, Signature::Numeric, Signature::Exact, Signature::Uniform and Signature::Comparable.

The most difference is that the implicit coercion is now defined by the user. This way any change for datafusion doesn't impact downstream projects unlike the existing signature like Signature::String or Signature::Coercible.

I probably don't have time to push it forward in the recent days, if you are interested in it you can work on it.

I think it makes sense to work on this for DataFusion 46!

We can implement first version for the functions you mentioned, I believe the change makes more sense

@jayzhan211 If I am understanding you correctly, Done! I added AnyNative and applied it to the functions mentioned.

CC @alamb @Omega359

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 4, 2025

@shehabgamin #14440 I come out a flexible version of Signature::CoercibleV2 (temporary name), it can replace Signature::Coercible, Signature::String, Signature::Numeric, Signature::Exact, Signature::Uniform and Signature::Comparable.
The most difference is that the implicit coercion is now defined by the user. This way any change for datafusion doesn't impact downstream projects unlike the existing signature like Signature::String or Signature::Coercible.
I probably don't have time to push it forward in the recent days, if you are interested in it you can work on it.

I think it makes sense to work on this for DataFusion 46!

We can implement first version for the functions you mentioned, I believe the change makes more sense

@jayzhan211 If I am understanding you correctly, Done! I added AnyNative and applied it to the functions mentioned.

CC @alamb @Omega359

I don't quite understand why we are adding more TypeSignatureClass here 😕

As we have discussed, we should avoid using old Coercible signature and also the TypeSignatureClass that is used in Coercible

@shehabgamin
Copy link
Contributor Author

I may add a new TypeSignatureClass in this pr to expose default_cast_for so that it's accessible by downstream users

Not sure do we need this 🤔 since we might not need this in the future, but if this helps downstream project then it is fine to me to add it for now and remove it in the future

We can implement first version for the functions you mentioned, I believe the change makes more sense

I don't quite understand why we are adding more TypeSignatureClass here 😕

@jayzhan211 What did I miss here?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 4, 2025

@jayzhan211 What did I miss here?

As we have discussed, we should avoid using old Coercible signature and also the TypeSignatureClass that is used in Coercible, because any change might impact downstream projects, although if we add new TypeSignatureClass like what you did will not, but adding more soon to remove things are not good idea.

We can implement first version for the functions you mentioned, I believe the change makes more sense

I think we can work directly on CoercibleV2 I mentioned for these functions

ascii
bit_length
contains
ends_with
starts_with
octet_length

@shehabgamin
Copy link
Contributor Author

As we have discussed, we should avoid using old Coercible signature and also the TypeSignatureClass that is used in Coercible, because any change might impact downstream projects, although if we add new TypeSignatureClass like what you did will not, but adding more soon to remove things are not good idea

@jayzhan211 I thought you said it was okay to add a new signature if it helps downstream projects. See here:
#14268 (comment)

I reverted Native back to the original logic before this PR and then I added AnyNative.

I think we can work directly on CoercibleV2 I mentioned for these functions

IMO CoercibleV2 would be out of scope for this PR. It looks like @alamb would like to get this PR sorted out if possible for 45 (see here: #14008 (comment))

If I am understanding you correctly, you are okay with adding a new signature but not applying them to the UDFs in this PR? Should I apply User-defined coercion as you were mentioning earlier?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 4, 2025

if it helps downstream projects

This is the point, I don't quite understand why adding signature for default_cast_to logic helps. I thought you want to solve the issues for these datafusion functions, so I don't know what others functions you want to solve and they are only possible if we have AnyNative.

Even if there is such case that we really need AnyNative, it seems strange to me to add in TypeSignatureClass but should be another new TypeSignature.

IMO CoercibleV2 would be out of scope for this PR

I think this is the solution to the issue we have, we need coercible like signature but not fixed logic exposed to the user since it makes any changes to it breaking change, while CoercibleV2 is not the case.

@shehabgamin First of all, what are the issues we are solving? Are these functions in datafusion?

ascii
bit_length
contains
ends_with
starts_with
octet_length

Are there others issues? Why AnyNative helps?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 4, 2025

Btw, if your proposed signature doesn't used by any functions in datafusion. You should use TypeSignature::UserDefined it is used for any other customize logic for downstream.

@shehabgamin
Copy link
Contributor Author

shehabgamin commented Feb 4, 2025

This is the point, I don't quite understand why adding signature for default_cast_to logic helps. I thought you want to solve the issues for these datafusion functions, so I don't know what others functions you want to solve and they are only possible if we have AnyNative.

@shehabgamin First of all, what are the issues we are solving? Are these functions in datafusion?

ascii
bit_length
contains
ends_with
starts_with
octet_length

Are there others issues? Why AnyNative helps?

@jayzhan211 The regression is that all these functions before DataFusion 43 would coerce:

  1. Binary -> String
  2. Int -> String

But now they no longer do.

I am trying to find some middle ground here. I am happy to implement any signature for the functions in this PR, but currently, default_cast_for is inaccessible for downstream users, adding AnyNative makes it accessible. Sail implements hundreds of custom UDFs, so it would be nice to have access to default_cast_for. We're not the only downstream project that has a need for flexible coercion like this (see here: #14230 (comment)).

Even if there is such case that we really need AnyNative, it seems strange to me to add in TypeSignatureClass but should be another new TypeSignature.

I'm not sure that AnyNative should be a new TypeSignature because TypeSignature::Coercible most accurately describes the behavior.

Btw, if your proposed signature doesn't used by any functions in datafusion. You should use TypeSignature::UserDefined it is used for any other customize logic for downstream.

The point is that default_cast_for is not accessible downstream. Should we just make default_cast_for a public function?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 4, 2025

Making default_cast_for or any casting logic publicly accessible introduces the risk of additional regressions with every change. I believe it was one of our worst decisions, and I hope we can avoid repeating it in the future. 🙏🏻

Sail implements hundreds of custom UDFs

I guess this is the real issue, how do we have a solution that is easier to solve those UDFs

I don't have the best solution in my mind now, but list the possible solutions

  1. The reason to not change TypeSignature::Coercible is to avoid another regression for others that use the existing logic.

  2. The reason to not introduce TypeSignature::CoercibleV2 is because it requires tons of change for 100+ UDFs

  3. The reason to not make default_cast_for public is what I mentioned above.

@shehabgamin Do you think we can write some utils function so we can make transferring to TypeSignature::CoercibleV2 easier?

@shehabgamin
Copy link
Contributor Author

shehabgamin commented Feb 4, 2025

@jayzhan211 To keep it simple ill just remove AnyNative and use coerce_types so we don't block this PR any longer. We can have a larger discussion and align on goals afterwards!

CC @alamb @Omega359

@shehabgamin
Copy link
Contributor Author

@jayzhan211 To keep it simple ill just remove AnyNative and use coerce_types so we don't block this PR any longer. We can have a larger discussion and align on goals afterwards!

CC @alamb @Omega359

Done, this should be good to merge now. AnyNative was removed, Native was reverted back to original logic, and for the 6 UDFs in the PR Signature::user_defined was used.

jayzhan211
jayzhan211 previously approved these changes Feb 5, 2025
@findepi
Copy link
Member

findepi commented Feb 5, 2025

I fully recognize this creates behavior that diverges from PostgreSQL/DuckDB semantics for the various UDFs in this PR. However, there’s a critical distinction: System contracts vs. SQL dialect behavior.

[...]

The current approach keeps the door open for both strict and permissive use cases through explicit signature choices. With these changes, users can still achieve PostgreSQL/DuckDB behavior by defining UDFs with existing signatures while still leveraging broader coercion where appropriate.

The function coercions are not enough for building a tailored system. Relational operators also do coercions (the set operators: union, intersect, except).

In any case, for certain system designs -- those who take on responsibility of implementing their particular SQL dialect behavior before handing over the control over to DF core -- it's desirable to opt out from any coercion logic at all. @shehabgamin @linhr Given the Sail design, you might be interested in #12723.

let arg_type = &arg_types[0];
let current_native_type: NativeType = arg_type.into();
let target_native_type = NativeType::String;
if current_native_type.is_integer()
Copy link
Contributor

@jayzhan211 jayzhan211 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we support integer? It is not consistent with Postgres/DuckDB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you and @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 Forsure, let's make sure @alamb is okay with this too before I go ahead and make the change.

let arg_type = &arg_types[0];
let current_native_type: NativeType = arg_type.into();
let target_native_type = NativeType::String;
if current_native_type.is_integer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with this

// Numeric
// Integer
Numeric(LogicalTypeRef),
Integer(LogicalTypeRef),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shehabgamin I found that we might not need LogicalTypeRef.
This is designed to accept all the Integer, so if the given type is integer, we keep it as it is.

If we want specific integer type, then we should use Native instead. Does this makes sense to you?

Numeric is the same


}
TypeSignatureClass::Integer(native_type) => {
let target_type = native_type.native();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let target_type = native_type.native();
Ok(current_type.to_owned())

return target_type.default_cast_for(current_type);
}
TypeSignatureClass::Numeric(native_type) => {
let target_type = native_type.native();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let target_type = native_type.native();
Ok(current_type.to_owned())

@jayzhan211 jayzhan211 dismissed their stale review February 6, 2025 13:19

Incorrect usage

TypeSignatureClass::Native(l) => get_data_types(l.native()),
TypeSignatureClass::Native(l)
| TypeSignatureClass::Numeric(l)
| TypeSignatureClass::Integer(l) => get_data_types(l.native()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_data_types is used only in get_possible_types

@jayzhan211
Copy link
Contributor

Close by #14440

@jayzhan211 jayzhan211 closed this Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate documentation Improvements or additions to documentation functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataFusion Regression (Starting in v43): Type Coercion for UDF Arguments (X --> String) for Specified UDFs

4 participants