[SPARK-33152][SQL] Improve the performance of constraint propagation for Project and Aggregate #30894

tanelk · 2020-12-22T20:12:41Z

What changes were proposed in this pull request?

To improve performance, changed the UnaryNode.getAllValidConstraints to discard constraints, that will not change the end result.

Why are the changes needed?

There has been at least two attempts at speeding up the constraint system: #30185 and #26257. Both of them seem to have stalled.

Optimizing project and aggregate nodes could have exponential memory and time complexity in relation to the number of aliases they have. Most simple example would be a Project, that has incoming columns a1, a2, ..., an and a child constraint a1 + a2 + ... + an > 0. If it would alias its columns a1 as b1, a2 as b2, ..., an as bn, then UnaryNode.getAllValidConstraints would return 2 to the power of n constraints (plus some isNotNull constraints) - each column is replaced by its alias in half of the constraints. All except one of these constraints would get filtered out later on - the one where all aliases are replaced is kept. Eagerly filtering these out will improve the performance and avoids possible OOM.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New and existing UTs.
Manually verified the performance gain by an example provided in #30185 :

  object Optimize extends RuleExecutor[LogicalPlan] {
    val batches =
      Batch("InferAndPushDownFilters", FixedPoint(100),
        PushPredicateThroughJoin,
        PushPredicateThroughNonJoin,
        InferFiltersFromConstraints,
        CombineFilters,
        SimplifyBinaryComparison,
        BooleanSimplification,
        PruneFilters) :: Nil
  }

  test("benchmark") {
    val tr = LocalRelation(
      'a.int, 'b.int, 'c.int, 'd.int, 'e.int,
      'f.int, 'g.int, 'h.int, 'i.int, 'j.int,
      'k.int, 'l.int, 'm.int, 'n.int)

    val plan = tr.select('a, 'b, 'c, 'd, 'e, 'f, 'g, 'h, 'i, 'j, 'k, 'l, 'm, 'n,
      CaseWhen(Seq(('a.attr + 'b.attr + 'c.attr + 'd.attr + 'e.attr + 'f.attr + 'g.attr
        + 'h.attr + 'i.attr + 'j.attr + 'k.attr + 'l.attr + 'm.attr + 'n.attr > Literal(1),
        Literal(1)),
        ('a.attr + 'b.attr + 'c.attr + 'd.attr + 'e.attr + 'f.attr + 'g.attr + 'h.attr +
          'i.attr + 'j.attr + 'k.attr + 'l.attr + 'm.attr + 'n.attr > Literal(2), Literal(2))),
        Option(Literal(0))).as("JoinKey1")
    ).select('a.attr.as("a1"), 'b.attr.as("b1"), 'c.attr.as("c1"),
      'd.attr.as("d1"), 'e.attr.as("e1"), 'f.attr.as("f1"),
      'g.attr.as("g1"), 'h.attr.as("h1"), 'i.attr.as("i1"),
      'j.attr.as("j1"), 'k.attr.as("k1"), 'l.attr.as("l1"),
      'm.attr.as("m1"), 'n.attr.as("n1"), 'JoinKey1.attr.as("cf1"),
      'JoinKey1.attr).select('a1, 'b1, 'c1, 'd1, 'e1, 'f1, 'g1, 'h1, 'i1, 'j1, 'k1,
      'l1, 'm1, 'n1, 'cf1, 'JoinKey1).join(tr, condition = Option('a.attr <=> 'JoinKey1.attr))

    val t1 = System.currentTimeMillis()
    Optimize.execute(plan.analyze)
    val t2 = System.currentTimeMillis()

    val timeTaken = t2 - t1
    // scalastyle:off println
    println(s"Time taken to optimize = $timeTaken ms")
    // scalastyle:on println
  }

The optimization time for this was reduced from 25s to 0.2s.

SparkQA · 2020-12-22T23:18:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37840/

SparkQA · 2020-12-22T23:47:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37840/

SparkQA · 2020-12-23T00:01:07Z

Test build #133242 has finished for PR 30894 at commit 08dc723.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-23T01:00:01Z

Test build #133240 has finished for PR 30894 at commit 03c6e56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-23T04:48:00Z

Test build #133262 has finished for PR 30894 at commit 4bd8c06.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-23T05:23:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37860/

SparkQA · 2020-12-23T05:52:47Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37860/

SparkQA · 2020-12-23T21:50:49Z

Test build #133318 has finished for PR 30894 at commit e554605.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ConstraintPropagationSuite extends SparkFunSuite with PlanTest with PrivateMethodTester

tanelk · 2020-12-24T09:56:01Z

cc @maropu , @HyukjinKwon

HyukjinKwon · 2021-01-03T04:12:14Z

cc @gengliangwang too

gengliangwang · 2021-01-04T13:29:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

+    }
+
+    /**
+    We keep the child constraints and equality between original and aliased attributes,


Since allConstraints is initially assigned as child.constraints, why do we need to keep the child constraints?

Because we might have filtered out some of them

Could you show an example for why we should keep child.constraints?

For example this test would fail:

test("SPARK-33152: infer from child constraint") { val plan = LocalRelation('a.int, 'b.int) .where('a === 'b) .select('a, ('b + 1) as 'b2) .analyze verifyConstraints(plan.constraints, ExpressionSet(Seq( IsNotNull(resolveColumn(plan, "a")), resolveColumn(plan, "a") + 1 <=> resolveColumn(plan, "b2") ))) }

To infer a + 1 <=> b2, it would need the child constraints.

gengliangwang · 2021-01-04T13:31:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

+    so [[ConstraintHelper.inferAdditionalConstraints]] would have the full information available.
+     */
+    projectList.foreach {
+      case alias @ Alias(expr, _) =>


maybe we just need to handle a @ Alias(l: Literal, _) here ?

You might be right, that only the literal aliases are used currently, but all aliases were kept in the previous code (lines 180 & 187) and when somebody wants to improve inferAdditionalConstraints, then they might need these.
Removing non-literal aliases would be marginal performance improvement and I would rather keep the existing behavior.

gengliangwang · 2021-01-04T13:31:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

+
+    // For each expression collect its aliases
+    val aliasMap = projectList.collect{
+      case alias @ Alias(expr, _) if !expr.foldable => (expr.canonicalized, alias)


We need to filter the non-deterministic expressions here.

Yes, you are correct. I'll add this as a safety feature.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

SparkQA · 2021-01-05T04:03:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38215/

SparkQA · 2021-01-05T04:34:57Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38215/

SparkQA · 2021-01-05T06:46:01Z

Test build #133626 has finished for PR 30894 at commit 8cf2da9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-05T07:51:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38238/

SparkQA · 2021-01-05T08:22:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38238/

SparkQA · 2021-01-05T10:45:37Z

Test build #133649 has finished for PR 30894 at commit 0c156f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hammertank · 2021-02-08T08:25:56Z

Checked the 2 git action errors:

CliSuite.SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path
It is a timeout error and cannot be reproduced in my local enviroment.
This test will download jar file from maven repository. Maybe it is a temporary network issue.
SparkScriptTransformationSuite.SPARK-33934: Add SparkFile's root dir to env property PATH
Caused by PR [SPARK-33934][SQL] Add SparkFile's root dir to env property PATH #30973 and has been fixed by PR [SPARK-33934][SQL][FOLLOW-UP] Use SubProcessor's exit code as assert condition to fix flaky test #31046

Look forward to see this PR get merged.

SparkQA · 2021-03-31T15:11:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41354/

SparkQA · 2021-03-31T15:20:28Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41354/

SparkQA · 2021-03-31T18:50:17Z

Test build #136771 has finished for PR 30894 at commit f1332eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tanelk · 2021-05-24T15:39:15Z

@cloud-fan, there have been some reviews allready, but perhaps you could also take a look.

github-actions · 2021-09-02T00:11:17Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

boneanxs · 2022-09-08T08:56:25Z

Hi @tanelk @HyukjinKwon @maropu any updates for this pr, we've also met this issue for many spark jobs.

Refactor getAllValidConstraints

03c6e56

github-actions bot added the SQL label Dec 22, 2020

tanelk added 2 commits December 22, 2020 22:40

2.13 compatability

119e057

Merge branch 'master' into SPARK-33152_constraint_propagation

08dc723

Improve test

4bd8c06

Assert validConstraints size

e554605

gengliangwang reviewed Jan 4, 2021

View reviewed changes

tanelk added 3 commits January 4, 2021 16:31

Address comments

22546e2

Add UT

ccabb69

Add UT

8cf2da9

maropu reviewed Jan 5, 2021

View reviewed changes

Address comments

0420514

Address comments

0c156f7

Merge branch 'master' into SPARK-33152_constraint_propagation

f1332eb

wankunde mentioned this pull request May 12, 2021

[SPARK-35379][SQL]Improve InferFiltersFromConstraints rule performance when parsing spark sql #32514

Closed

github-actions bot added the Stale label Sep 2, 2021

github-actions bot closed this Sep 3, 2021

Uh oh!

[SPARK-33152][SQL] Improve the performance of constraint propagation for Project and Aggregate #30894

[SPARK-33152][SQL] Improve the performance of constraint propagation for Project and Aggregate #30894

Uh oh!

Conversation

tanelk commented Dec 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Dec 22, 2020

Uh oh!

SparkQA commented Dec 22, 2020

Uh oh!

SparkQA commented Dec 23, 2020

Uh oh!

SparkQA commented Dec 23, 2020

Uh oh!

SparkQA commented Dec 23, 2020

Uh oh!

SparkQA commented Dec 23, 2020

Uh oh!

SparkQA commented Dec 23, 2020

Uh oh!

SparkQA commented Dec 23, 2020

Uh oh!

tanelk commented Dec 24, 2020

Uh oh!

HyukjinKwon commented Jan 3, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Jan 5, 2021

Uh oh!

SparkQA commented Jan 5, 2021

Uh oh!

SparkQA commented Jan 5, 2021

Uh oh!

SparkQA commented Jan 5, 2021

Uh oh!

SparkQA commented Jan 5, 2021

Uh oh!

SparkQA commented Jan 5, 2021

Uh oh!

hammertank commented Feb 8, 2021

Uh oh!

SparkQA commented Mar 31, 2021

Uh oh!

SparkQA commented Mar 31, 2021

Uh oh!

SparkQA commented Mar 31, 2021

Uh oh!

tanelk commented May 24, 2021

Uh oh!

github-actions bot commented Sep 2, 2021

Uh oh!

boneanxs commented Sep 8, 2022

Uh oh!

Reviewers

Assignees

tanelk commented Dec 22, 2020 •

edited

Loading