Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Oct 25, 2019

What changes were proposed in this pull request?

This PR try to improve EliminateOuterJoin performance via avoid generating too many constraints. For example:

import org.apache.spark.sql.catalyst.plans.logical.Project
spark.sql("CREATE TABLE IF NOT EXISTS spark_29606(a int, b int, c int) USING parquet")
spark.sql("SELECT a as a1, b as b1, c as c1, abc as abc1 FROM (SELECT a, b, c, a + b + c as abc FROM spark_29606) t")
  .queryExecution.analyzed.asInstanceOf[Project].validConstraints.toSeq.sortBy(_.toString).foreach(println)

Before this PR:

(((a#5 + b#6) + c#7) <=> abc#0)
(((a#5 + b#6) + c#7) <=> abc1#4)
(((a#5 + b#6) + c1#3) <=> abc#0)
(((a#5 + b#6) + c1#3) <=> abc1#4)
(((a#5 + b1#2) + c#7) <=> abc#0)
(((a#5 + b1#2) + c#7) <=> abc1#4)
(((a#5 + b1#2) + c1#3) <=> abc#0)
(((a#5 + b1#2) + c1#3) <=> abc1#4)
(((a1#1 + b#6) + c#7) <=> abc#0)
(((a1#1 + b#6) + c#7) <=> abc1#4)
(((a1#1 + b#6) + c1#3) <=> abc#0)
(((a1#1 + b#6) + c1#3) <=> abc1#4)
(((a1#1 + b1#2) + c#7) <=> abc#0)
(((a1#1 + b1#2) + c#7) <=> abc1#4)
(((a1#1 + b1#2) + c1#3) <=> abc#0)
(((a1#1 + b1#2) + c1#3) <=> abc1#4)
(a#5 <=> a1#1)
(abc#0 <=> abc1#4)
(b#6 <=> b1#2)
(c#7 <=> c1#3)

After this PR:

(((a#5 + b#6) + c#7) <=> abc#0)
(a#5 <=> a1#1)
(abc#0 <=> abc1#4)
(b#6 <=> b1#2)
(c#7 <=> c1#3)

Why are the changes needed?

Improve EliminateOuterJoin performance.

Before this PR:

=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 9323
Total time: 15.995000924 seconds

Rule                                                                                               Effective Time / Total Time                     Effective Runs / Total Runs

org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin                                         0 / 13359017999                                 0 / 4
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations                                   990683801 / 991674120                           2 / 18
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables                                      0 / 443718064                                   0 / 18
org.apache.spark.sql.catalyst.analysis.DecimalPrecision                                            43087519 / 81709524                             2 / 18
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions                               63414650 / 63414650                             1 / 1
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences                                  42587256 / 62760566                             5 / 18
...

After this PR:

=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 9323
Total time: 3.03253427 seconds

Rule                                                                                               Effective Time / Total Time                     Effective Runs / Total Runs

org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations                                   1130336633 / 1131323257                         2 / 18
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables                                      0 / 448236638                                   0 / 18
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin                                         0 / 107133411                                   0 / 4
org.apache.spark.sql.catalyst.analysis.DecimalPrecision                                            43965067 / 84085638                             2 / 18
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences                                  44448673 / 66690250                             5 / 18
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions                               59631169 / 59631169                             1 / 1
...

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

@SparkQA
Copy link

SparkQA commented Oct 25, 2019

Test build #112678 has finished for PR 26257 at commit 6337c77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Oct 28, 2019

still WIP?

@wangyum
Copy link
Member Author

wangyum commented Oct 29, 2019

Thank you @maropu Actually, I am not very confident about this change. Is it make sense for you?

@maropu
Copy link
Member

maropu commented Oct 29, 2019

Ur, I see. ok, I've not digged into this, so I'll check later.

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113884 has finished for PR 26257 at commit 719d812.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum changed the title [WIP][SPARK-29606][SQL] Improve EliminateOuterJoin performance [SPARK-29606][SQL] Improve EliminateOuterJoin performance Nov 16, 2019
@SparkQA
Copy link

SparkQA commented Nov 16, 2019

Test build #113922 has finished for PR 26257 at commit ca4480e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Nov 17, 2019

cc @cloud-fan @viirya

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 27, 2020
@wangyum wangyum removed the Stale label Feb 27, 2020
@SparkQA
Copy link

SparkQA commented Feb 29, 2020

Test build #119128 has finished for PR 26257 at commit 29fe7f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Mar 1, 2020

Metrics of Analyzer/Optimizer Rules for TPCDSQuerySuite.
Before this PR:

05:43:31.429 WARN org.apache.spark.sql.TPCDSQuerySuite: 
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 224786
Total time: 65.253508851 seconds
...

After this PR:

05:38:55.596 WARN org.apache.spark.sql.TPCDSQuerySuite: 
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 224786
Total time: 56.81012364 seconds

a.toAttribute
})
allConstraints ++= allConstraints.map {
case e @ EqualNullSafe(l, _: AttributeReference) if l.references.size > 1 => e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel its a bit difficult to understand this pattern-matching at a glance, could you leave comments about what this means? Probably, you wanna skip the pattern below for performance?

@SparkQA
Copy link

SparkQA commented May 2, 2020

Test build #122201 has finished for PR 26257 at commit 6b88d28.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented May 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2020

Test build #122212 has finished for PR 26257 at commit 6b88d28.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants