[SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge #37711

peter-toth · 2022-08-29T16:46:54Z

What changes were proposed in this pull request?

After #32298 we were able to merge scalar subquery plans, but DSv2 sources couldn't benefit from that improvement.
The reason for DSv2 sources were not supported by default is that SparkOptimizer.earlyScanPushDownRules build different Scans in logical plans before MergeScalarSubqueries is executed. Those Scans can have different pushed-down filters and aggregates and different column pruning defined, which prevents merging the plans.
I would not alter the order of optimization rules as MergeScalarSubqueries works better when logical plans are better optimized (a plan is closer to its final logical form, e.g. InjectRuntimeFilter already executed). But instead, I would propose a new interface that a Scan can implement to indicate if merge is possible with another Scan and do the merge if it make sense depending on the Scan's actual parameters.

This PR:

adds a new interface SupportsMerge that Scans can implement to indicate if 2 Scans can be merged and
adds implementation of SupportsMerge to ParquetScan as the first DSv2 source. The merge only happens if pushed-down data and partition filters and pushed-down aggregates match.

Why are the changes needed?

Scalar subquery merge can bring considerable performance improvement (see the original #32298 for the benchmarks) so DSv2 sources should also benefit from that feature.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UT.

peter-toth · 2022-08-29T17:01:48Z

cc @cloud-fan, @sigmod, @singhpk234

github-actions · 2022-12-15T00:20:30Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

singhpk234 · 2022-09-05T11:30:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala

+          pushedDownAggEqual(o) &&
+          normalizedPartitionFilters == o.normalizedPartitionFilters &&
+          normalizedDataFilters == o.normalizedDataFilters) {
+        val builder = table.newScanBuilder(options).asInstanceOf[ParquetScanBuilder]


[question] should we add assertion for table.newScanBuilder should be a instance of ParquetScanBuilder ?

singhpk234 · 2022-12-27T18:16:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala

+  private def pushedDownAggEqual(p: ParquetScan) = {
+    if (pushedAggregate.nonEmpty && p.pushedAggregate.nonEmpty) {
+      AggregatePushDownUtils.equivalentAggregations(pushedAggregate.get, p.pushedAggregate.get)
+    } else {
+      pushedAggregate.isEmpty && p.pushedAggregate.isEmpty
+    }
+  }


should we move this to FileScan itself ? OrcScan also has some duplicate code

singhpk234 · 2022-12-27T18:18:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala

  }
+
+  override def mergeWith(other: SupportsMerge, table: SupportsRead): Optional[SupportsMerge] = {
+    if (other.isInstanceOf[ParquetScan]) {


can replace this with case match

singhpk234 · 2022-12-28T20:41:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala

+          normalizedPartitionFilters == o.normalizedPartitionFilters &&
+          normalizedDataFilters == o.normalizedDataFilters) {


[question] should we just disjunct these diff filters from scans and run a boolean simplification on top of it ? to handle the cases with diff partition and data filter on the scans ?

Are we expecting some heuristic here ? as if when combining the filters will be useful ?

singhpk234 · 2022-12-28T20:47:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala

+            if (mappedNewKeyGroupedPartitioning.map(_.map(_.canonicalized)) ==
+              cachedKeyGroupedPartitioning.map(_.map(_.canonicalized))) {
+              val mappedNewOrdering = newOrdering.map(_.map(mapAttributes(_, outputMap)))
+              if (mappedNewOrdering.map(_.map(_.canonicalized)) ==


[minor] can we simplify the if else structure here ? something like

if (isKeyGroupPartitioningSame && isOrderingSame) { // merge scans and update cachedRelation } else { None }

peter-toth · 2022-12-29T11:18:31Z

Thanks for the comments @singhpk234!
Unfortunately this PR got closed due to lack of reviews and can't be reopened. I'm happy to open a new one and take your suggestions into account, but first it would be great if a Spark committer would confirm that the proposed SupportsMerge scan interface makes sense and somone have willingness to give some feedback about the change. Any feedback is much appreciated, really.

Maybe @cloud-fan or @gengliangwang are you interested in this PR?

[SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge

bc3aecb

peter-toth mentioned this pull request Aug 29, 2022

[SPARK-40193][SQL] Merge subquery plans with different filters #37630

Closed

github-actions bot added the SQL label Aug 29, 2022

fix test

0eefece

github-actions bot added the Stale label Dec 15, 2022

github-actions bot closed this Dec 16, 2022

singhpk234 reviewed Dec 28, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge #37711

[SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge #37711

Uh oh!

peter-toth commented Aug 29, 2022 •

edited

Loading

Uh oh!

peter-toth commented Aug 29, 2022

Uh oh!

github-actions bot commented Dec 15, 2022

Uh oh!

singhpk234 Sep 5, 2022

Uh oh!

singhpk234 Dec 27, 2022

Uh oh!

singhpk234 Dec 27, 2022

Uh oh!

singhpk234 Dec 28, 2022

Uh oh!

singhpk234 Dec 28, 2022

Uh oh!

peter-toth commented Dec 29, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		normalizedPartitionFilters == o.normalizedPartitionFilters &&
		normalizedDataFilters == o.normalizedDataFilters) {

[SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge #37711

[SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge #37711

Uh oh!

Conversation

peter-toth commented Aug 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

peter-toth commented Aug 29, 2022

Uh oh!

github-actions bot commented Dec 15, 2022

Uh oh!

singhpk234 Sep 5, 2022

Choose a reason for hiding this comment

Uh oh!

singhpk234 Dec 27, 2022

Choose a reason for hiding this comment

Uh oh!

singhpk234 Dec 27, 2022

Choose a reason for hiding this comment

Uh oh!

singhpk234 Dec 28, 2022

Choose a reason for hiding this comment

Uh oh!

singhpk234 Dec 28, 2022

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Dec 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peter-toth commented Aug 29, 2022 •

edited

Loading

peter-toth commented Dec 29, 2022 •

edited

Loading