-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33486][SQL] Collapse Partial and Final physical aggregation nodes together whenever possible #30426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33486][SQL] Collapse Partial and Final physical aggregation nodes together whenever possible #30426
Conversation
|
ok to test |
|
Test build #131346 has finished for PR 30426 at commit
|
|
I remember SPARK-12978 (#15945 and #10896) and is this related to it? cc: @cloud-fan Btw, have you checked if this optimization could make some queries (e.g., TPCDS) faster? (I just want to know actual performance numbers) |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #131347 has finished for PR 30426 at commit
|
|
@maropu Thanks for pointing out to old PR and jirs - Yes SPARK-12978 seems related to SPARK-33486.
I did impact analysis on TPCDS 100 scale and didn't find noticeable improvement - In TPCDS at most of the places, the 1st HashAggregate (HA) reduces rows significantly and the 2nd HA doesn't take a lot of time after that. But we have seen some good improvements in some customer queries - Specifically when HA-1 doesn't reduce rows significantly. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #131360 has finished for PR 30426 at commit
|
|
Test build #131425 has finished for PR 30426 at commit
|
dfad4fc to
a56846d
Compare
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
Test build #131427 has finished for PR 30426 at commit
|
|
@maropu @cloud-fan Gentle reminder - Please review the changes and provide your feedback. |
Yea, I've checked TPCDS performances w/this change again by myself, but I couldn't find any improvement. So, could you give us a concrete example of how much it will improve performance? This change can make rules complicated, so I think we need to consider the tradeoff between complexity and performance improvements. |
@maropu We have seen customer queries where Aggregation happens on close to primary keys. In those scenarios, it makes complete sense to remove redundant Aggregation operator as it will unnecessarily increase the execution time. |
|
We have also seen the use case with customers when they do aggregation on close to primary keys. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR tries to reduce the number of physical aggregation nodes by collapsing the PARTIAL and the FINAL aggregation nodes together when there is no Exchange between them.
Example - consider the following query:
Current plan:
The above plan can be optimized to following:
Why are the changes needed?
This change removed the unrequired Aggregation node and so will help in improving performance.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added UTs.