[SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression #27198

maropu · 2020-01-14T05:28:31Z

What changes were proposed in this pull request?

This pr intends to add filter information in the explain output of an aggregate (This is a follow-up of #26656).

Without this pr:

scala> sql("select k, SUM(v) filter (where v > 3) from t group by k").explain(true)
== Parsed Logical Plan ==
'Aggregate ['k], ['k, unresolvedalias('SUM('v, ('v > 3)), None)]
+- 'UnresolvedRelation [t]

== Analyzed Logical Plan ==
k: int, sum(v): bigint
Aggregate [k#0], [k#0, sum(cast(v#1 as bigint)) AS sum(v)#3L]
+- SubqueryAlias `default`.`t`
   +- Relation[k#0,v#1] parquet

== Optimized Logical Plan ==
Aggregate [k#0], [k#0, sum(cast(v#1 as bigint)) AS sum(v)#3L]
+- Relation[k#0,v#1] parquet

== Physical Plan ==
HashAggregate(keys=[k#0], functions=[sum(cast(v#1 as bigint))], output=[k#0, sum(v)#3L])
+- Exchange hashpartitioning(k#0, 200), true, [id=#20]
   +- HashAggregate(keys=[k#0], functions=[partial_sum(cast(v#1 as bigint))], output=[k#0, sum#7L])
      +- *(1) ColumnarToRow
         +- FileScan parquet default.t[k#0,v#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int,v:int>


scala> sql("select k, SUM(v) filter (where v > 3) from t group by k").show()
+---+------+                                                                    
|  k|sum(v)|
+---+------+
+---+------+

With this pr:

scala> sql("select k, SUM(v) filter (where v > 3) from t group by k").explain(true)
== Parsed Logical Plan ==
'Aggregate ['k], ['k, unresolvedalias('SUM('v, ('v > 3)), None)]
+- 'UnresolvedRelation [t]

== Analyzed Logical Plan ==
k: int, sum(v) FILTER (v > 3): bigint
Aggregate [k#0], [k#0, sum(cast(v#1 as bigint)) filter (v#1 > 3) AS sum(v) FILTER (v > 3)#5L]
+- SubqueryAlias `default`.`t`
   +- Relation[k#0,v#1] parquet

== Optimized Logical Plan ==
Aggregate [k#0], [k#0, sum(cast(v#1 as bigint)) filter (v#1 > 3) AS sum(v) FILTER (v > 3)#5L]
+- Relation[k#0,v#1] parquet

== Physical Plan ==
HashAggregate(keys=[k#0], functions=[sum(cast(v#1 as bigint))], output=[k#0, sum(v) FILTER (v > 3)#5L])
+- Exchange hashpartitioning(k#0, 200), true, [id=#20]
   +- HashAggregate(keys=[k#0], functions=[partial_sum(cast(v#1 as bigint)) filter (v#1 > 3)], output=[k#0, sum#9L])
      +- *(1) ColumnarToRow
         +- FileScan parquet default.t[k#0,v#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int,v:int>


scala> sql("select k, SUM(v) filter (where v > 3) from t group by k").show()
+---+---------------------+                                                     
|  k|sum(v) FILTER (v > 3)|
+---+---------------------+
+---+---------------------+

Why are the changes needed?

For better usability.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually.

maropu · 2020-01-14T05:28:54Z

How about this? @beliefer @cloud-fan

beliefer · 2020-01-14T06:10:07Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

-    prefix + aggregateFunction.toAggString(isDistinct)
+    val aggFuncStr = prefix + aggregateFunction.toAggString(isDistinct)
+    mode match {
+      case Partial | Complete if filter.isDefined =>


Although we made filter evaluated in first Aggregate.
I think there should just judge that the filter is defined.

Doing so shows a filter in a second aggregate in physical plans like this;

== Physical Plan == HashAggregate(keys=[k#0], functions=[sum(cast(v#1 as bigint)) filter ((v#1 > 3))], output=[k#0, sum(v) FILTER ((v > 3))#44L]) +- Exchange hashpartitioning(k#0, 200), true, [id=#154] +- HashAggregate(keys=[k#0], functions=[partial_sum(cast(v#1 as bigint)) filter ((v#1 > 3))], output=[k#0, sum#48L]) +- *(1) ColumnarToRow +- FileScan parquet default.t[k#0,v#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int,v:int>

I mean the mode is PartialMerge.
If we only need to show filter in physical plans after rewrite, this is OK.

SparkQA · 2020-01-14T07:04:37Z

Test build #116684 has finished for PR 27198 at commit e54839b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-14T08:05:01Z

Test build #116689 has finished for PR 27198 at commit cc658c9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-14T08:18:15Z

retest this please

cloud-fan · 2020-01-14T09:14:50Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+  override def sql: String = {
+    val aggFuncStr = aggregateFunction.sql(isDistinct)
+    mode match {
+      case Partial | Complete if filter.isDefined =>


another idea: if the filter is not used under some modes, can we drop it when we do things like aggExprs.map(_.copy(mode = ABC))? Then here we can blindly print the filter if it's there.

Yes. It might be a cleaner, but the AS-IS PR also looks safer because this PR is a read-only update.

Yes it's safe, but the part confuses me is why only partial and complete mode respect the filter. I still need to look at AggUtils, where we set the mode.

Yea, ok. I'll check the approach, too.

SparkQA · 2020-01-14T12:32:02Z

Test build #116693 has finished for PR 27198 at commit cc658c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-15T08:05:02Z

Test build #116756 has finished for PR 27198 at commit 753e9e1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-15T08:17:38Z

retest this please

cloud-fan · 2020-01-15T08:20:10Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

-    prefix + aggregateFunction.toAggString(isDistinct)
+    val aggFuncStr = prefix + aggregateFunction.toAggString(isDistinct)
+    filter match {
+      case Some(predicate) => s"$aggFuncStr filter $predicate"


shall we follow the sql syntax? FILTER (WHERE $predicate)

cloud-fan · 2020-01-15T08:20:15Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

+  override def sql: String = {
+    val aggFuncStr = aggregateFunction.sql(isDistinct)
+    filter match {
+      case Some(predicate) => s"$aggFuncStr FILTER ${predicate.sql}"


cloud-fan · 2020-01-15T08:21:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala

+          // Aggregate filters are applicable only in partial/complete modes;
+          // this method filters out them, otherwise.
+          case Partial | Complete => ae
+          case _ => ae.copy(filter = None)


We can also simplify AggregateExpression.references now

SparkQA · 2020-01-15T09:59:01Z

Test build #116767 has finished for PR 27198 at commit 753e9e1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-15T13:10:18Z

LGTM

SparkQA · 2020-01-15T19:31:35Z

Test build #116783 has finished for PR 27198 at commit 9d4ba8b.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-15T19:55:18Z

Retest this please.

dongjoon-hyun · 2020-01-15T20:00:06Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

-    prefix + aggregateFunction.toAggString(isDistinct)
+    val aggFuncStr = prefix + aggregateFunction.toAggString(isDistinct)
+    filter match {
+      case Some(predicate) => s"$aggFuncStr filter (where $predicate)"


If you don't mind, filter (where -> FILTER (WHERE?

Currently, in the generated result, filter (where NOT and FILTER (WHERE (NOT are used together. Maybe, the same output might be better?

sum(salary#x) filter (where NOT exists#x [dept_id#x]) AS sum(salary) FILTER (WHERE (NOT exists(dept_id)))#x]

SparkQA · 2020-01-15T23:57:59Z

Test build #116796 has finished for PR 27198 at commit 9d4ba8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-16T00:03:19Z

@dongjoon-hyun

dongjoon-hyun

+1, LGTM.

SparkQA · 2020-01-16T00:21:33Z

Test build #116799 has finished for PR 27198 at commit 275ca45.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-16T00:23:48Z

retest this please

SparkQA · 2020-01-16T01:50:30Z

Test build #116800 has finished for PR 27198 at commit f43c981.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-16T02:12:06Z

Merged to master. Thanks, @cloud-fan & @dongjoon-hyun !

SparkQA · 2020-01-16T02:59:17Z

Test build #116801 has finished for PR 27198 at commit f43c981.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Fix

e54839b

beliefer reviewed Jan 14, 2020

View reviewed changes

Update the golden files

cc658c9

cloud-fan reviewed Jan 14, 2020

View reviewed changes

dongjoon-hyun added the SQL label Jan 15, 2020

Fix

753e9e1

cloud-fan reviewed Jan 15, 2020

View reviewed changes

Fix

9d4ba8b

dongjoon-hyun reviewed Jan 15, 2020

View reviewed changes

maropu added 2 commits January 16, 2020 06:20

Fix

275ca45

Fix

f43c981

dongjoon-hyun approved these changes Jan 16, 2020

View reviewed changes

maropu closed this in a3a42b3 Jan 16, 2020

[SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression #27198

[SPARK-27986][SQL][FOLLOWUP] Respect filter in sql/toString of AggregateExpression #27198

Uh oh!

Conversation

maropu commented Jan 14, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

maropu commented Jan 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 14, 2020

Uh oh!

SparkQA commented Jan 14, 2020

Uh oh!

maropu commented Jan 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 14, 2020

Uh oh!

SparkQA commented Jan 15, 2020

Uh oh!

maropu commented Jan 15, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 15, 2020

Uh oh!

cloud-fan commented Jan 15, 2020

Uh oh!

SparkQA commented Jan 15, 2020

Uh oh!

dongjoon-hyun commented Jan 15, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 15, 2020

Uh oh!

maropu commented Jan 16, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 16, 2020

Uh oh!

maropu commented Jan 16, 2020

Uh oh!

SparkQA commented Jan 16, 2020

Uh oh!

maropu commented Jan 16, 2020

Uh oh!

SparkQA commented Jan 16, 2020

Uh oh!

Reviewers

Assignees

Labels

dongjoon-hyun Jan 15, 2020 •

edited

Loading