[SPARK-33486][SQL] Collapse Partial and Final physical aggregation nodes together whenever possible #30426

prakharjain09 · 2020-11-19T09:06:27Z

What changes were proposed in this pull request?

This PR tries to reduce the number of physical aggregation nodes by collapsing the PARTIAL and the FINAL aggregation nodes together when there is no Exchange between them.

Example - consider the following query:

SELECT sum(t2.col1), max(t2.col2), t1.col1, t1.col2
FROM t1, t2
WHERE t1.col1 = t2.col1
GROUP BY t1.col1, t1.col2

Current plan:

  == Physical Plan ==
  *(5) HashAggregate(keys=[col1#7, col2#8], functions=[sum(cast(col1#18 as bigint)), max(col2#19)], output=[sum(col1)#140L, max(col2)#141, col1#7, col2#8])
  +- *(5) HashAggregate(keys=[col1#7, col2#8], functions=[partial_sum(cast(col1#18 as bigint)), partial_max(col2#19)], output=[col1#7, col2#8, sum#148L, max#149])
     +- *(5) SortMergeJoin [col1#7], [col1#18], Inner
        :- *(2) Sort [col1#7 ASC NULLS FIRST], false, 0
        :  +- Exchange hashpartitioning(col1#7, 5), true, [id=#644]
        :     +- *(1) Project [value#2 AS col1#7, (value#2 % 10) AS col2#8]
        :        +- *(1) SerializeFromObject [input[0, int, false] AS value#2]
        :           +- Scan[obj#1]
        +- *(4) Sort [col1#18 ASC NULLS FIRST], false, 0
           +- Exchange hashpartitioning(col1#18, 5), true, [id=#653]
              +- *(3) Project [value#13 AS col1#18, (value#13 % 10) AS col2#19]
                 +- *(3) SerializeFromObject [input[0, int, false] AS value#13]
                    +- Scan[obj#12]

The above plan can be optimized to following:

  == Physical Plan ==
  *(5) HashAggregate(keys=[col1#7, col2#8], functions=[sum(cast(col1#18 as bigint)), max(col2#19)], output=[sum(col1)#157L, max(col2)#158, col1#7, col2#8])
  +- *(5) SortMergeJoin [col1#7], [col1#18], Inner
     :- *(2) Sort [col1#7 ASC NULLS FIRST], false, 0
     :  +- Exchange hashpartitioning(col1#7, 5), true, [id=#727]
     :     +- *(1) Project [value#2 AS col1#7, (value#2 % 10) AS col2#8]
     :        +- *(1) SerializeFromObject [input[0, int, false] AS value#2]
     :           +- Scan[obj#1]
     +- *(4) Sort [col1#18 ASC NULLS FIRST], false, 0
        +- Exchange hashpartitioning(col1#18, 5), true, [id=#736]
           +- *(3) Project [value#13 AS col1#18, (value#13 % 10) AS col2#19]
              +- *(3) SerializeFromObject [input[0, int, false] AS value#13]
                 +- Scan[obj#12]

Why are the changes needed?

This change removed the unrequired Aggregation node and so will help in improving performance.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UTs.

maropu · 2020-11-19T11:45:56Z

ok to test

SparkQA · 2020-11-19T11:54:15Z

Test build #131346 has finished for PR 30426 at commit 5965fb9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

prakharjain09 · 2020-11-19T12:35:20Z

cc - @maropu @cloud-fan @dongjoon-hyun

maropu · 2020-11-19T12:35:35Z

I remember SPARK-12978 (#15945 and #10896) and is this related to it? cc: @cloud-fan Btw, have you checked if this optimization could make some queries (e.g., TPCDS) faster? (I just want to know actual performance numbers)

SparkQA · 2020-11-19T12:42:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35950/

SparkQA · 2020-11-19T13:08:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35950/

SparkQA · 2020-11-19T13:18:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35951/

SparkQA · 2020-11-19T13:41:55Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35951/

…ggregates

SparkQA · 2020-11-19T14:32:40Z

Test build #131347 has finished for PR 30426 at commit 2c68fe3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

prakharjain09 · 2020-11-19T17:44:29Z

@maropu Thanks for pointing out to old PR and jirs - Yes SPARK-12978 seems related to SPARK-33486.

Btw, have you checked if this optimization could make some queries (e.g., TPCDS) faster?

I did impact analysis on TPCDS 100 scale and didn't find noticeable improvement - In TPCDS at most of the places, the 1st HashAggregate (HA) reduces rows significantly and the 2nd HA doesn't take a lot of time after that.

But we have seen some good improvements in some customer queries - Specifically when HA-1 doesn't reduce rows significantly.

SparkQA · 2020-11-19T17:45:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35964/

SparkQA · 2020-11-19T18:16:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35964/

SparkQA · 2020-11-19T20:39:08Z

Test build #131360 has finished for PR 30426 at commit e9a25d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-20T13:44:04Z

Test build #131425 has finished for PR 30426 at commit dfad4fc.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

…ggregates

SparkQA · 2020-11-20T14:24:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36031/

SparkQA · 2020-11-20T14:52:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36033/

SparkQA · 2020-11-20T14:52:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36031/

SparkQA · 2020-11-20T15:17:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36033/

SparkQA · 2020-11-20T19:26:24Z

Test build #131427 has finished for PR 30426 at commit e7f326a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class GetShufflePushMergerLocations(numMergersNeeded: Int, hostsToFilter: Set[String])
case class RemoveShufflePushMergerLocation(host: String) extends ToBlockManagerMaster
abstract class LikeAllBase extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant
case class LikeAll(child: Expression, patterns: Seq[UTF8String]) extends LikeAllBase
case class NotLikeAll(child: Expression, patterns: Seq[UTF8String]) extends LikeAllBase
case class ParseUrl(children: Seq[Expression], failOnError: Boolean = SQLConf.get.ansiEnabled)

prakharjain09 · 2020-11-23T07:31:52Z

@maropu @cloud-fan Gentle reminder - Please review the changes and provide your feedback.

maropu · 2020-11-24T01:11:07Z

But we have seen some good improvements in some customer queries - Specifically when HA-1 doesn't reduce rows significantly.

Yea, I've checked TPCDS performances w/this change again by myself, but I couldn't find any improvement. So, could you give us a concrete example of how much it will improve performance? This change can make rules complicated, so I think we need to consider the tradeoff between complexity and performance improvements.

prakharjain09 · 2020-11-27T06:27:06Z

So, could you give us a concrete example of how much it will improve performance?

@maropu We have seen customer queries where Aggregation happens on close to primary keys. In those scenarios, it makes complete sense to remove redundant Aggregation operator as it will unnecessarily increase the execution time.

abmodi · 2020-12-04T12:50:16Z

We have also seen the use case with customers when they do aggregation on close to primary keys.

github-actions · 2021-03-15T00:48:40Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

prakharjain09 added 3 commits November 19, 2020 12:17

Added CollapseAggregates rule in physical planning

d60fb9f

Merge remote-tracking branch 'oss/master' into merge-aggregates

5203571

plan files updated

5965fb9

github-actions bot added the SQL label Nov 19, 2020

fix stylecheck

2c68fe3

Merge remote-tracking branch 'oss/master' into SPARK-33486-collapse-a…

23f120d

…ggregates

prakharjain09 added 2 commits November 19, 2020 20:35

updated golden output files

3736d19

test fixes

e9a25d9

test fixes

a56846d

prakharjain09 force-pushed the SPARK-33486-collapse-aggregates branch from dfad4fc to a56846d Compare November 20, 2020 14:07

Merge remote-tracking branch 'oss/master' into SPARK-33486-collapse-a…

e7f326a

…ggregates

github-actions bot added the Stale label Mar 15, 2021

github-actions bot closed this Mar 16, 2021

[SPARK-33486][SQL] Collapse Partial and Final physical aggregation nodes together whenever possible #30426

[SPARK-33486][SQL] Collapse Partial and Final physical aggregation nodes together whenever possible #30426

Uh oh!

Conversation

prakharjain09 commented Nov 19, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

maropu commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

prakharjain09 commented Nov 19, 2020

Uh oh!

maropu commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

prakharjain09 commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

SparkQA commented Nov 20, 2020

Uh oh!

prakharjain09 commented Nov 23, 2020

Uh oh!

maropu commented Nov 24, 2020

Uh oh!

prakharjain09 commented Nov 27, 2020

Uh oh!

abmodi commented Dec 4, 2020

Uh oh!

github-actions bot commented Mar 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants