[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec #29415

huaxingao · 2020-08-12T07:26:39Z

What changes were proposed in this pull request?

Remove fullOutput from RowDataSourceScanExec

Why are the changes needed?

RowDataSourceScanExec requires the full output instead of the scan output after column pruning. However, in v2 code path, we don't have the full output anymore so we just pass the pruned output. RowDataSourceScanExec.fullOutput is actually meaningless so we should remove it.

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests

SparkQA · 2020-08-12T12:03:53Z

Test build #127371 has finished for PR 29415 at commit 9558823.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-08-12T16:59:29Z

cc @cloud-fan @MaxGekk @viirya

cloud-fan · 2020-08-12T17:23:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala


 /** Physical plan node for scanning data from a relation. */
 case class RowDataSourceScanExec(
-    fullOutput: Seq[Attribute],


can you find out the PR that added it? I can't quite remember why we have it.

It was introduced in #18600 for plan equality comparison.
I manually print out the two canonicalized plans for df1 and df2 in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/RowDataSourceStrategySuite.scala#L68 to check my change.
Before my change:

*(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0])
+- Exchange hashpartitioning(none#0, 5), true, [id=#25]
+- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
+- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [none#0,none#1] PushedFilters: [], ReadSchema: structnone:int,none:int

*(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0])
+- Exchange hashpartitioning(none#0, 5), true, [id=#52]
+- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
+- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [none#0,none#2] PushedFilters: [], ReadSchema: structnone:int,none:int

After my change :

*(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0])
+- Exchange hashpartitioning(none#0, 5), true, [id=#25]
+- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
+- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [A#0,B#1] PushedFilters: [], ReadSchema: struct<A:int,B:int>

*(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0])
+- Exchange hashpartitioning(none#0, 5), true, [id=#52]
+- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
+- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [A#0,C#2] PushedFilters: [], ReadSchema: struct<A:int,C:int>

viirya

fullOutput seems having no actual usage except for plan comparison. If we can make sure we don't break it, looks ok to remove fullOutput.

cloud-fan · 2020-08-13T03:45:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

  // Don't care about `rdd` and `tableIdentifier` when canonicalizing.
  override def doCanonicalize(): SparkPlan =
    copy(
-      fullOutput.map(QueryPlan.normalizeExpressions(_, fullOutput)),


don't we need to normalize output now?

FileSourceScanExec does it as well.

We may need to add requiredSchema to RowDataSourceScanExec

Sorry I didn't know that we need to use the normalized exprId in canonicalized plan. If we do, then probably we can't remove fullOutput from RowDataSourceScanExec, because using the normalized pruned output would cause problem. For example, in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/RowDataSourceStrategySuite.scala#L68, normalized the pruned output will give none#0,none#1 for both df1 and df2, and then both of them have exactly the same plan

*(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0]) +- Exchange hashpartitioning(none#0, 5), true, [id=#110] +- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [none#0,none#1] PushedFilters: [], ReadSchema: struct<none:int,none:int>

Then in df1.union(df2), it takes ReusedExchange code path since both plans are equal

== Physical Plan == Union :- *(2) HashAggregate(keys=[a#0], functions=[min(b#1)], output=[a#0, min(b)#12]) : +- Exchange hashpartitioning(a#0, 5), true, [id=#34] : +- *(1) HashAggregate(keys=[a#0], functions=[partial_min(b#1)], output=[a#0, min#28]) : +- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [A#0,B#1] PushedFilters: [], ReadSchema: struct<A:int,B:int> +- *(4) HashAggregate(keys=[a#0], functions=[min(c#2)], output=[a#0, min(c)#24]) +- ReusedExchange [a#0, min#30], Exchange hashpartitioning(a#0, 5), true, [id=#34]

The union result will be

+---+------+ | a|min(b)| +---+------+ | 1| 2| | 1| 2| +---+------+

instead of

+---+------+ | a|min(b)| +---+------+ | 1| 2| | 1| 3| +---+------+

yea that's why I propose to add requiredSchema, like what FileSourceScanExec does. But I'm not sure how hard it is.

@cloud-fan I added requiredSchema, could you please take a look to see if that's what you want?

SparkQA · 2020-08-13T20:48:21Z

Test build #127416 has finished for PR 29415 at commit bd58665.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-14T08:08:26Z

thanks, merging to master!

huaxingao · 2020-08-14T15:07:55Z

Thanks! @cloud-fan @viirya

[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec

9558823

probot-autolabeler bot added the SQL label Aug 12, 2020

cloud-fan reviewed Aug 12, 2020

View reviewed changes

viirya reviewed Aug 12, 2020

View reviewed changes

cloud-fan reviewed Aug 13, 2020

View reviewed changes

add requiredSchema

bd58665

cloud-fan closed this in 14003d4 Aug 14, 2020

huaxingao deleted the rm_full_output branch August 14, 2020 15:07

[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec #29415

[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec #29415

Uh oh!

Conversation

huaxingao commented Aug 12, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Aug 12, 2020

Uh oh!

huaxingao commented Aug 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 13, 2020

Uh oh!

cloud-fan commented Aug 14, 2020

Uh oh!

huaxingao commented Aug 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants