-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous #38676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I did not manage to test this in My approach is This creates plan while this plan would be required to expose the bug (both |
|
Note: |
6599f96 to
08f6ea3
Compare
|
@wangyum @cloud-fan appreciate your suggestion on how to test this bug in |
|
@EnricoMi @cloud-fan Could we fix the |
Interesting, that sounds like a better solution. I'll look into it. |
|
Problem is that There is now a second run of rule It is now safe to apply rule This could potentially be done for all operators specifically handled in Deduplicating attributes that are already referenced will break the plan as those references break. https://github.com/G-Research/spark/actions/runs/3498935957/jobs/5862050045 |
0948536 to
c7eaaa2
Compare
| condition: Option[Expression]) extends UnaryNode { | ||
|
|
||
| require(Seq(Inner, LeftOuter, Cross).contains(joinType), | ||
| require(Seq(Inner, LeftOuter, Cross).contains(joinType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just needed by sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/DeduplicateRelationsSuite.scala:
val originalQuery = left.lateralJoin(right, UsingJoin(Inner, Seq("a")))
|
Can one of the admins verify this patch? |
|
@wangyum @cloud-fan I am not sure if this is the right approach to fix Problem is that |
|
@wangyum @cloud-fan what do you think about my approach? Do you have a suggestion for a better strategy? |
|
@wangyum @cloud-fan do you consider this issue a correctness bug? |
|
I tried looking into this a bit
As @EnricoMi said Similar to the |
|
@shardulm94 you are right, I have created #39131, to keep it separate from this approach, which tries to fix this issue through Thanks for the pointer! |
|
Closed in favour of #39131. |
What changes were proposed in this pull request?
Rule
PushDownLeftSemiAntiJoinshould not push an anti-join below anAggregatewhen the translated (cf.aliasMap) join conditions become ambiguous w.r.t. to both join sides.Why are the changes needed?
This example fails with
distinct(), and succeeds withoutdistinct(), but both queries are identical:With
distinct(), rulePushDownLeftSemiAntiJoincreates a join condition(id#774 + 1) = id#774, which can never be true. This effectively removes the anti-join.Before this PR:
The anti-join is fully removed from the plan.
This is caused by
PushDownLeftSemiAntiJoinadding join condition(id#774 + 1) = id#774:After this PR:
Join condition
id#776 = id#774is still translated into(id#774 + 1) = id#774but recognized as ambiguous to both sides of the prospect join and hence not pushed down. The rule is then not applied any more.The final plan contains the anti-join:
Does this PR introduce any user-facing change?
It fixes correctness.
How was this patch tested?
Unit test.