[SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous #38676

EnricoMi · 2022-11-16T15:35:04Z

What changes were proposed in this pull request?

Rule PushDownLeftSemiAntiJoin should not push an anti-join below an Aggregate when the translated (cf. aliasMap) join conditions become ambiguous w.r.t. to both join sides.

Why are the changes needed?

This example fails with distinct(), and succeeds without distinct(), but both queries are identical:

val ids = Seq(1, 2, 3).toDF("id").distinct()
val result = ids.withColumn("id", $"id" + 1).join(ids, "id", "left_anti").collect()
assert(result.length == 1)

With distinct(), rule PushDownLeftSemiAntiJoin creates a join condition (id#774 + 1) = id#774, which can never be true. This effectively removes the anti-join.

Before this PR:
The anti-join is fully removed from the plan.

*(2) HashAggregate(keys=[id#4], functions=[], output=[id#6])
+- AQEShuffleRead coalesced
   +- ShuffleQueryStage 0
      +- Exchange hashpartitioning(id#4, 200), ENSURE_REQUIREMENTS, [plan_id=19]
         +- *(1) HashAggregate(keys=[id#4], functions=[], output=[id#4])
            +- *(1) LocalTableScan [id#4]

This is caused by PushDownLeftSemiAntiJoin adding join condition (id#774 + 1) = id#774:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
!Join LeftAnti, (id#776 = id#774)                  'Aggregate [id#774], [(id#774 + 1) AS id#776]
!:- Aggregate [id#774], [(id#774 + 1) AS id#776]   +- 'Join LeftAnti, ((id#774 + 1) = id#774)
!:  +- LocalRelation [id#774]                         :- LocalRelation [id#774]
!+- Aggregate [id#774], [id#774]                      +- Aggregate [id#774], [id#774]
!   +- LocalRelation [id#774]                            +- LocalRelation [id#774]

After this PR:
Join condition id#776 = id#774 is still translated into (id#774 + 1) = id#774 but recognized as ambiguous to both sides of the prospect join and hence not pushed down. The rule is then not applied any more.

The final plan contains the anti-join:

*(4) BroadcastHashJoin [id#53], [id#51], LeftAnti, BuildRight, false
:- *(4) HashAggregate(keys=[id#51], functions=[], output=[id#53])
:  +- AQEShuffleRead coalesced
:     +- ShuffleQueryStage 0
:        +- Exchange hashpartitioning(id#51, 200), ENSURE_REQUIREMENTS, [plan_id=382]
:           +- *(1) HashAggregate(keys=[id#51], functions=[], output=[id#51])
:              +- *(1) LocalTableScan [id#51]
+- BroadcastQueryStage 3
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [plan_id=424]
      +- *(3) HashAggregate(keys=[id#51], functions=[], output=[id#51])
         +- AQEShuffleRead coalesced
            +- ShuffleQueryStage 2
               +- ReusedExchange [id#51], Exchange hashpartitioning(id#51, 200), ENSURE_REQUIREMENTS, [plan_id=382]

Does this PR introduce any user-facing change?

It fixes correctness.

How was this patch tested?

Unit test.

EnricoMi · 2022-11-16T15:41:44Z

I did not manage to test this in LeftSemiAntiJoinPushDownSuite, which would be preferrably.

My approach is

  test("Aggregate: LeftAnti join no pushdown on ambiguity") {
    val relation = testRelation
      .groupBy($"b")($"b", sum($"c").as("sum"))
    val relationPlusOne = relation.select(($"b" + 1).as("b"))

    val originalQuery = relationPlusOne
      .join(relation, joinType = LeftAnti, usingColumns = Seq("b"))

    val optimized = Optimize.execute(originalQuery.analyze)
    comparePlans(optimized, originalQuery.analyze)
  }

This creates plan

Project [b#7]
+- Join LeftAnti, (b#7 = b#12)
   :- Project [(b#1 + 1) AS b#7]
   :  +- Aggregate [b#1], [b#1, sum(c#2) AS sum#6L]
   :     +- LocalRelation <empty>, [a#0, b#1, c#2]
   +- Aggregate [b#12], [b#12, sum(c#13) AS sum#6L]
      +- LocalRelation <empty>, [a#11, b#12, c#13]

while this plan would be required to expose the bug (both Aggregate plans have to have identical references):

Project [b#7]
+- Join LeftAnti, (b#7 = b#1)
   :- Project [(b#1 + 1) AS b#7]
   :  +- Aggregate [b#1], [b#1, sum(c#2) AS sum#6L]
   :     +- LocalRelation <empty>, [a#0, b#1, c#2]
   +- Aggregate [b#1], [b#1, sum(c#2) AS sum#6L]
      +- LocalRelation <empty>, [a#0, b#1, c#2]

EnricoMi · 2022-11-16T15:44:28Z

Note: Window, Union and UnaryNode in PushDownLeftSemiAntiJoin might be affected as well and should be tested in LeftSemiAntiJoinPushDownSuite as well.

EnricoMi · 2022-11-17T08:18:09Z

@wangyum @cloud-fan appreciate your suggestion on how to test this bug in LeftSemiAntiJoinPushDownSuite (see #38676 (comment)).

wangyum · 2022-11-18T07:49:15Z

@EnricoMi @cloud-fan Could we fix the DeduplicateRelations? It did not generate different expression IDs for all conflicting attributes:

=== Applying Rule org.apache.spark.sql.catalyst.analysis.DeduplicateRelations ===
 Join LeftSemi                         Join LeftSemi
 :- Project [(id#4 + 1) AS id#6]       :- Project [(id#4 + 1) AS id#6]
 :  +- Deduplicate [id#4]              :  +- Deduplicate [id#4]
 :     +- Project [value#1 AS id#4]    :     +- Project [value#1 AS id#4]
 :        +- LocalRelation [value#1]   :        +- LocalRelation [value#1]
 +- Deduplicate [id#4]                 +- Deduplicate [id#4]
!   +- Project [value#1 AS id#4]          +- Project [value#8 AS id#4]
!      +- LocalRelation [value#1]            +- LocalRelation [value#8]

EnricoMi · 2022-11-18T10:54:39Z

Could we fix the DeduplicateRelations?

Interesting, that sounds like a better solution. I'll look into it.

EnricoMi · 2022-11-18T17:11:31Z

Problem is that DeduplicateRelations is only considering duplicates between left output and right output, and not duplicates between left references and right output. I have sketched a fix for Join and LateralJoin, including a proper test.

There is now a second run of rule DeduplicateRelations:

=== Applying Rule org.apache.spark.sql.catalyst.analysis.DeduplicateRelations ===
 'Join UsingJoin(Inner,List(a))                 'Join UsingJoin(Inner,List(a))
 :- Project [(a#1 + 1) AS a#2]                  :- Project [(a#1 + 1) AS a#2]
 :  +- Deduplicate                              :  +- Deduplicate
 :     +- Project [value#0 AS a#1]              :     +- Project [value#0 AS a#1]
 :        +- LocalRelation <empty>, [value#0]   :        +- LocalRelation <empty>, [value#0]
 +- Deduplicate                                 +- Deduplicate
!   +- Project [value#3 AS a#1]                    +- Project [value#3 AS a#4]
       +- LocalRelation <empty>, [value#3]            +- LocalRelation <empty>, [value#3]

It is now safe to apply rule PushDownLeftSemiAntiJoin.

This could potentially be done for all operators specifically handled in DeduplicateRelations.apply(), i.e. AsOfJoin, Intersect, Except, Union and MergeIntoTable.

Deduplicating attributes that are already referenced will break the plan as those references break.

After applying rule org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken.

https://github.com/G-Research/spark/actions/runs/3498935957/jobs/5862050045

EnricoMi · 2022-11-18T17:27:18Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

    condition: Option[Expression]) extends UnaryNode {

-  require(Seq(Inner, LeftOuter, Cross).contains(joinType),
+  require(Seq(Inner, LeftOuter, Cross).contains(joinType match {


just needed by sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/DeduplicateRelationsSuite.scala:

val originalQuery = left.lateralJoin(right, UsingJoin(Inner, Seq("a")))

AmplabJenkins · 2022-11-19T17:28:12Z

Can one of the admins verify this patch?

EnricoMi · 2022-11-22T07:39:29Z

@wangyum @cloud-fan I am not sure if this is the right approach to fix DeduplicateRelations. Please advise.

Problem is that DeduplicateRelations is only considering duplicates between left output and right output, but this situation is caused by duplicates between left references and right output.

EnricoMi · 2022-12-02T14:49:21Z

@wangyum @cloud-fan what do you think about my approach? Do you have a suggestion for a better strategy?

EnricoMi · 2022-12-12T09:37:16Z

@wangyum @cloud-fan do you consider this issue a correctness bug?

shardulm94 · 2022-12-17T04:34:24Z

I tried looking into this a bit

@EnricoMi @cloud-fan Could we fix the DeduplicateRelations? It did not generate different expression IDs for all conflicting attributes:

As @EnricoMi said DeduplicateRelations only considers the output attrs of the left and right, which do not conflict here. Also the Project case in PushDownLeftSemiAntiJoin calls this method which seems to check for self-join case based on conflicting expression ids. This makes me believe the duplicate expression IDs are expected here and hence DeduplicateRelations may not be at fault.

Similar to the Project case, should we add a check like canPushThroughCondition(Seq(agg.child), joinCond, rightOp) to ensure that it is safe to push the join down an Aggregate node too?

EnricoMi · 2022-12-19T17:37:24Z

@shardulm94 you are right, canPushThroughCondition already guards Project and Union against this situation, so that should be the natural way to fix this for Aggregate as well. It is the least possible change to fix this issue.

I have created #39131, to keep it separate from this approach, which tries to fix this issue through DeduplicateRelations. Will close this PR if the other one makes it.

Thanks for the pointer!

EnricoMi · 2023-01-04T17:07:43Z

Closed in favour of #39131.

github-actions bot added the SQL label Nov 16, 2022

EnricoMi changed the title ~~[SPARK-41162][SQL] Do not push down join predicate that are ambiguous to both sides~~ [SPARK-41162][SQL] Do not push down join predicates that are ambiguous Nov 16, 2022

EnricoMi changed the title ~~[SPARK-41162][SQL] Do not push down join predicates that are ambiguous~~ [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous Nov 16, 2022

Do not push down join predicate that are ambiguous to both sides

08f6ea3

EnricoMi force-pushed the branch-leftanti-rule-fix branch from 6599f96 to 08f6ea3 Compare November 17, 2022 07:15

EnricoMi added 3 commits November 18, 2022 18:25

Move unit test into DataFrameJoinSuite, test LeftSemi too

7333c31

Make DeduplicateRelations considere references as well

d991a02

Revert modification to PushDownLeftSemiAntiJoin rule

c7eaaa2

EnricoMi force-pushed the branch-leftanti-rule-fix branch from 0948536 to c7eaaa2 Compare November 18, 2022 17:26

EnricoMi commented Nov 18, 2022

View reviewed changes

EnricoMi mentioned this pull request Dec 20, 2022

[SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations #39131

Closed

EnricoMi closed this Jan 4, 2023

[SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous #38676

[SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous #38676

Uh oh!

Conversation

EnricoMi commented Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

EnricoMi commented Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EnricoMi commented Nov 16, 2022

Uh oh!

EnricoMi commented Nov 17, 2022

Uh oh!

wangyum commented Nov 18, 2022

Uh oh!

EnricoMi commented Nov 18, 2022

Uh oh!

EnricoMi commented Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EnricoMi Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Nov 19, 2022

Uh oh!

EnricoMi commented Nov 22, 2022

Uh oh!

EnricoMi commented Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EnricoMi commented Dec 12, 2022

Uh oh!

shardulm94 commented Dec 17, 2022

Uh oh!

EnricoMi commented Dec 19, 2022

Uh oh!

EnricoMi commented Jan 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EnricoMi commented Nov 16, 2022 •

edited

Loading

EnricoMi commented Nov 16, 2022 •

edited

Loading

EnricoMi commented Nov 18, 2022 •

edited

Loading

EnricoMi commented Dec 2, 2022 •

edited

Loading