[SPARK-29606][SQL] Improve EliminateOuterJoin performance #26257

wangyum · 2019-10-25T16:03:13Z

What changes were proposed in this pull request?

This PR try to improve EliminateOuterJoin performance via avoid generating too many constraints. For example:

import org.apache.spark.sql.catalyst.plans.logical.Project
spark.sql("CREATE TABLE IF NOT EXISTS spark_29606(a int, b int, c int) USING parquet")
spark.sql("SELECT a as a1, b as b1, c as c1, abc as abc1 FROM (SELECT a, b, c, a + b + c as abc FROM spark_29606) t")
  .queryExecution.analyzed.asInstanceOf[Project].validConstraints.toSeq.sortBy(_.toString).foreach(println)

Before this PR:

(((a#5 + b#6) + c#7) <=> abc#0)
(((a#5 + b#6) + c#7) <=> abc1#4)
(((a#5 + b#6) + c1#3) <=> abc#0)
(((a#5 + b#6) + c1#3) <=> abc1#4)
(((a#5 + b1#2) + c#7) <=> abc#0)
(((a#5 + b1#2) + c#7) <=> abc1#4)
(((a#5 + b1#2) + c1#3) <=> abc#0)
(((a#5 + b1#2) + c1#3) <=> abc1#4)
(((a1#1 + b#6) + c#7) <=> abc#0)
(((a1#1 + b#6) + c#7) <=> abc1#4)
(((a1#1 + b#6) + c1#3) <=> abc#0)
(((a1#1 + b#6) + c1#3) <=> abc1#4)
(((a1#1 + b1#2) + c#7) <=> abc#0)
(((a1#1 + b1#2) + c#7) <=> abc1#4)
(((a1#1 + b1#2) + c1#3) <=> abc#0)
(((a1#1 + b1#2) + c1#3) <=> abc1#4)
(a#5 <=> a1#1)
(abc#0 <=> abc1#4)
(b#6 <=> b1#2)
(c#7 <=> c1#3)

After this PR:

(((a#5 + b#6) + c#7) <=> abc#0)
(a#5 <=> a1#1)
(abc#0 <=> abc1#4)
(b#6 <=> b1#2)
(c#7 <=> c1#3)

Why are the changes needed?

Improve EliminateOuterJoin performance.

Before this PR:

=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 9323
Total time: 15.995000924 seconds

Rule                                                                                               Effective Time / Total Time                     Effective Runs / Total Runs

org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin                                         0 / 13359017999                                 0 / 4
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations                                   990683801 / 991674120                           2 / 18
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables                                      0 / 443718064                                   0 / 18
org.apache.spark.sql.catalyst.analysis.DecimalPrecision                                            43087519 / 81709524                             2 / 18
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions                               63414650 / 63414650                             1 / 1
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences                                  42587256 / 62760566                             5 / 18
...

After this PR:

=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 9323
Total time: 3.03253427 seconds

Rule                                                                                               Effective Time / Total Time                     Effective Runs / Total Runs

org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations                                   1130336633 / 1131323257                         2 / 18
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables                                      0 / 448236638                                   0 / 18
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin                                         0 / 107133411                                   0 / 4
org.apache.spark.sql.catalyst.analysis.DecimalPrecision                                            43965067 / 84085638                             2 / 18
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences                                  44448673 / 66690250                             5 / 18
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions                               59631169 / 59631169                             1 / 1
...

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

SparkQA · 2019-10-25T19:47:15Z

Test build #112678 has finished for PR 26257 at commit 6337c77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-10-28T00:07:48Z

still WIP?

wangyum · 2019-10-29T08:32:35Z

Thank you @maropu Actually, I am not very confident about this change. Is it make sense for you?

maropu · 2019-10-29T08:34:29Z

Ur, I see. ok, I've not digged into this, so I'll check later.

SparkQA · 2019-11-15T19:56:16Z

Test build #113884 has finished for PR 26257 at commit 719d812.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-16T13:38:09Z

Test build #113922 has finished for PR 26257 at commit ca4480e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-11-17T01:31:10Z

cc @cloud-fan @viirya

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

github-actions · 2020-02-27T00:13:02Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

… > 1 => e

SparkQA · 2020-02-29T20:18:42Z

Test build #119128 has finished for PR 26257 at commit 29fe7f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-03-01T13:48:22Z

Metrics of Analyzer/Optimizer Rules for TPCDSQuerySuite.
Before this PR:

05:43:31.429 WARN org.apache.spark.sql.TPCDSQuerySuite: 
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 224786
Total time: 65.253508851 seconds
...

After this PR:

05:38:55.596 WARN org.apache.spark.sql.TPCDSQuerySuite: 
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 224786
Total time: 56.81012364 seconds

maropu · 2020-04-29T03:15:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

-            a.toAttribute
-        })
+        allConstraints ++= allConstraints.map {
+          case e @ EqualNullSafe(l, _: AttributeReference) if l.references.size > 1 => e


I feel its a bit difficult to understand this pattern-matching at a glance, could you leave comments about what this means? Probably, you wanna skip the pattern below for performance?

SparkQA · 2020-05-02T07:05:02Z

Test build #122201 has finished for PR 26257 at commit 6b88d28.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-05-02T08:49:29Z

retest this please

SparkQA · 2020-05-02T13:56:33Z

Test build #122212 has finished for PR 26257 at commit 6b88d28.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2020-08-11T00:36:12Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Improve EliminateOuterJoin performance

6337c77

dongjoon-hyun added the SQL label Oct 25, 2019

Try 2

719d812

Avoid generating too many constraints

ca4480e

wangyum changed the title ~~[WIP][SPARK-29606][SQL] Improve EliminateOuterJoin performance~~ [SPARK-29606][SQL] Improve EliminateOuterJoin performance Nov 16, 2019

cloud-fan reviewed Nov 18, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala Outdated Show resolved Hide resolved

github-actions bot added the Stale label Feb 27, 2020

wangyum removed the Stale label Feb 27, 2020

wangyum added 2 commits March 1, 2020 00:10

Merge remote-tracking branch 'upstream/master' into SPARK-29606

ab77d62

case e @ EqualNullSafe(l, _: AttributeReference) if l.references.size…

29fe7f0

… > 1 => e

maropu reviewed Apr 29, 2020

View reviewed changes

wangyum added 2 commits May 2, 2020 14:23

Add comment

50fd377

Merge remote-tracking branch 'upstream/master' into SPARK-29606

6b88d28

github-actions bot added the Stale label Aug 11, 2020

github-actions bot closed this Aug 12, 2020

tanelk mentioned this pull request Dec 22, 2020

[SPARK-33152][SQL] Improve the performance of constraint propagation for Project and Aggregate #30894

Closed

wangyum mentioned this pull request Dec 7, 2021

[SPARK-37392][SQL] Fix the performance bug when inferring constraints for Generate #34823

Closed

[SPARK-29606][SQL] Improve EliminateOuterJoin performance #26257

[SPARK-29606][SQL] Improve EliminateOuterJoin performance #26257

Uh oh!

Conversation

wangyum commented Oct 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 25, 2019

Uh oh!

maropu commented Oct 28, 2019

Uh oh!

wangyum commented Oct 29, 2019

Uh oh!

maropu commented Oct 29, 2019

Uh oh!

SparkQA commented Nov 15, 2019

Uh oh!

SparkQA commented Nov 16, 2019

Uh oh!

wangyum commented Nov 17, 2019

Uh oh!

Uh oh!

github-actions bot commented Feb 27, 2020

Uh oh!

SparkQA commented Feb 29, 2020

Uh oh!

wangyum commented Mar 1, 2020

Uh oh!

maropu Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 2, 2020

Uh oh!

maropu commented May 2, 2020

Uh oh!

SparkQA commented May 2, 2020

Uh oh!

github-actions bot commented Aug 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangyum commented Oct 25, 2019 •

edited

Loading