Skip to content

Conversation

@findepi
Copy link
Member

@findepi findepi commented Jul 15, 2025

Default implementation of ScalarUDFImpl::equals, AggregateUDFImpl::equals and WindowUDFImpl::equals is correct for stateless functions and those which only state is the Signature, which is most of the functions.

This implements equals and hash_value for functions that have state other than Signature object.

This fixes correctness issues which could occur when such stateful functions are used together in one query.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate spark labels Jul 15, 2025
Default implementation of `ScalarUDFImpl::equals`,
`AggregateUDFImpl::equals` and `WindowUDFImpl::equals` is correct for
stateless functions and those which only state is the `Signature`, which
is most of the functions.

This implements `equals` and `hash_value` for functions that have state
other than `Signature` object.

This fixes correctness issues which could occur when such stateful
functions are used together in one query.
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a suggestion and a question.

Copy link
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks very reasonable, but there is a lot of boiler plate code. I'm guessing there wasn't an easier way to add some derived traits for PartialEq and Hash for these structs instead. I'll try to set aside some time to go through each of these, but it's going to take a bit.

@findepi
Copy link
Member Author

findepi commented Jul 15, 2025

The approach looks very reasonable, but there is a lot of boiler plate code. I'm guessing there wasn't an easier way to add some derived traits for PartialEq and Hash for these structs instead.

It would work for some, but not all.
If we want ptr eq for an Arc, we need hand-written eq and hash anyway.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi and @kosiew and @timsaucer -- I think this PR makes sense to me and I like how explicit it is.

It took a while to get through 😅
Screenshot 2025-07-15 at 4 53 35 PM

I think it could be reduced in size somewhat by adding aliases to the default implementations.

My biggest concern is that in the future when we add new functions / modify existing ones we will forget to add a corresponding equals and hash implementations

While I also don't like the amount of almost-identical copy/paste code in this PR I don't really have any better suggestion and thus I think this PR is an improvement over main

To try and mitigage the issues, I suggest:

  1. Add a comment to all the implementations added in this PR like // Implement equals because there are fields other than name and signature and // implement hash because there are fields other than name and signature
  2. Maybe update the comments in
    /// Return true if this scalar UDF is equal to the other.
    ///
    /// Allows customizing the equality of scalar UDFs.
    /// Must be consistent with [`Self::hash_value`] and follow the same rules as [`Eq`]:
    ///
    /// - reflexive: `a.equals(a)`;
    /// - symmetric: `a.equals(b)` implies `b.equals(a)`;
    /// - transitive: `a.equals(b)` and `b.equals(c)` implies `a.equals(c)`.
    ///
    /// By default, compares [`Self::name`] and [`Self::signature`].
    and
    /// Returns a hash value for this scalar UDF.
    ///
    /// Allows customizing the hash code of scalar UDFs. Similarly to [`Hash`] and [`Eq`],
    /// if [`Self::equals`] returns true for two UDFs, their `hash_value`s must be the same.
    ///
    /// By default, hashes [`Self::name`] and [`Self::signature`].
    to say when the functions should be overridden (for example "If your implementation has fields other than signature and name, such as aliases, you likely should implement equals and hash as well`)

@findepi
Copy link
Member Author

findepi commented Jul 16, 2025

I think it could be reduced in size somewhat by adding aliases to the default implementations.

i'd rather see stuff like aliases and documentation not be part of ScalardUDFImpl at all.
The logical plan doesn't have to bear documentation of its constituents.

But yeah, i can add aliases there.

My biggest concern is that in the future when we add new functions / modify existing ones we will forget to add a corresponding equals and hash implementations

This is absolutely NOT improved by this PR. The problem remains unsolved.

To try and mitigage the issues, I suggest:

  1. Add a comment to all the implementations added in this PR like // Implement equals because there are fields other than name and signature and // implement hash because there are fields other than name and signature

I don't think it helps.
Whenever you see equals/hash functions you know what they are for.
The crux of the problem is -- you need to be aware of their existence in the ScalardUDFImpl to even implement them!
I'd suggest removing default implementations if we cannot make them correct. Not in this PR.

@github-actions github-actions bot removed the spark label Jul 16, 2025
@findepi
Copy link
Member Author

findepi commented Jul 17, 2025

@alamb @timsaucer @kosiew would you like to take a look at new code pushed here since the time you last reviewed?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the commits since I last reviewed and they look good to me -- thank you @findepi

@findepi findepi merged commit afd8235 into apache:main Jul 18, 2025
27 checks passed
@findepi findepi deleted the findepi/equals branch July 18, 2025 09:27
findepi added a commit to sdf-labs/datafusion that referenced this pull request Jul 24, 2025
* Implement equals for stateful functions

Default implementation of `ScalarUDFImpl::equals`,
`AggregateUDFImpl::equals` and `WindowUDFImpl::equals` is correct for
stateless functions and those which only state is the `Signature`, which
is most of the functions.

This implements `equals` and `hash_value` for functions that have state
other than `Signature` object.

This fixes correctness issues which could occur when such stateful
functions are used together in one query.

* downgrade for MSRV

* Improve doc

* Update default UDF:: equals to compare aliases too

* Update default UDF:: equals to compare type too (‼️)

* remove now-obsoleted UDF equals/hash customizations

remove these which compare signature, aliases, as the default handles
these now

* remove equals impl which compares name, signature -- default does that

* cleanup imports

(cherry picked from commit afd8235)
findepi added a commit to sdf-labs/datafusion that referenced this pull request Jul 24, 2025
* Implement equals for stateful functions

Default implementation of `ScalarUDFImpl::equals`,
`AggregateUDFImpl::equals` and `WindowUDFImpl::equals` is correct for
stateless functions and those which only state is the `Signature`, which
is most of the functions.

This implements `equals` and `hash_value` for functions that have state
other than `Signature` object.

This fixes correctness issues which could occur when such stateful
functions are used together in one query.

* downgrade for MSRV

* Improve doc

* Update default UDF:: equals to compare aliases too

* Update default UDF:: equals to compare type too (‼️)

* remove now-obsoleted UDF equals/hash customizations

remove these which compare signature, aliases, as the default handles
these now

* remove equals impl which compares name, signature -- default does that

* cleanup imports

(cherry picked from commit afd8235)
findepi added a commit to sdf-labs/datafusion that referenced this pull request Aug 4, 2025
* Implement equals for stateful functions

Default implementation of `ScalarUDFImpl::equals`,
`AggregateUDFImpl::equals` and `WindowUDFImpl::equals` is correct for
stateless functions and those which only state is the `Signature`, which
is most of the functions.

This implements `equals` and `hash_value` for functions that have state
other than `Signature` object.

This fixes correctness issues which could occur when such stateful
functions are used together in one query.

* downgrade for MSRV

* Improve doc

* Update default UDF:: equals to compare aliases too

* Update default UDF:: equals to compare type too (‼️)

* remove now-obsoleted UDF equals/hash customizations

remove these which compare signature, aliases, as the default handles
these now

* remove equals impl which compares name, signature -- default does that

* cleanup imports

(cherry picked from commit afd8235)
findepi added a commit to sdf-labs/datafusion that referenced this pull request Aug 13, 2025
* Implement equals for stateful functions

Default implementation of `ScalarUDFImpl::equals`,
`AggregateUDFImpl::equals` and `WindowUDFImpl::equals` is correct for
stateless functions and those which only state is the `Signature`, which
is most of the functions.

This implements `equals` and `hash_value` for functions that have state
other than `Signature` object.

This fixes correctness issues which could occur when such stateful
functions are used together in one query.

* downgrade for MSRV

* Improve doc

* Update default UDF:: equals to compare aliases too

* Update default UDF:: equals to compare type too (‼️)

* remove now-obsoleted UDF equals/hash customizations

remove these which compare signature, aliases, as the default handles
these now

* remove equals impl which compares name, signature -- default does that

* cleanup imports

(cherry picked from commit afd8235)
findepi added a commit to sdf-labs/datafusion that referenced this pull request Aug 13, 2025
* Implement equals for stateful functions

Default implementation of `ScalarUDFImpl::equals`,
`AggregateUDFImpl::equals` and `WindowUDFImpl::equals` is correct for
stateless functions and those which only state is the `Signature`, which
is most of the functions.

This implements `equals` and `hash_value` for functions that have state
other than `Signature` object.

This fixes correctness issues which could occur when such stateful
functions are used together in one query.

* downgrade for MSRV

* Improve doc

* Update default UDF:: equals to compare aliases too

* Update default UDF:: equals to compare type too (‼️)

* remove now-obsoleted UDF equals/hash customizations

remove these which compare signature, aliases, as the default handles
these now

* remove equals impl which compares name, signature -- default does that

* cleanup imports

(cherry picked from commit afd8235)
findepi added a commit to sdf-labs/datafusion that referenced this pull request Sep 1, 2025
* Implement equals for stateful functions

Default implementation of `ScalarUDFImpl::equals`,
`AggregateUDFImpl::equals` and `WindowUDFImpl::equals` is correct for
stateless functions and those which only state is the `Signature`, which
is most of the functions.

This implements `equals` and `hash_value` for functions that have state
other than `Signature` object.

This fixes correctness issues which could occur when such stateful
functions are used together in one query.

* downgrade for MSRV

* Improve doc

* Update default UDF:: equals to compare aliases too

* Update default UDF:: equals to compare type too (‼️)

* remove now-obsoleted UDF equals/hash customizations

remove these which compare signature, aliases, as the default handles
these now

* remove equals impl which compares name, signature -- default does that

* cleanup imports

(cherry picked from commit afd8235)
findepi added a commit to sdf-labs/datafusion that referenced this pull request Sep 5, 2025
* Implement equals for stateful functions

Default implementation of `ScalarUDFImpl::equals`,
`AggregateUDFImpl::equals` and `WindowUDFImpl::equals` is correct for
stateless functions and those which only state is the `Signature`, which
is most of the functions.

This implements `equals` and `hash_value` for functions that have state
other than `Signature` object.

This fixes correctness issues which could occur when such stateful
functions are used together in one query.

* downgrade for MSRV

* Improve doc

* Update default UDF:: equals to compare aliases too

* Update default UDF:: equals to compare type too (‼️)

* remove now-obsoleted UDF equals/hash customizations

remove these which compare signature, aliases, as the default handles
these now

* remove equals impl which compares name, signature -- default does that

* cleanup imports

(cherry picked from commit afd8235)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants