[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer #30558

aokolnychyi · 2020-11-30T20:50:15Z

What changes were proposed in this pull request?

This PR adds a new batch to the optimizer for executing rules that rewrite plans for data sources.

Why are the changes needed?

Right now, we have a special place in the optimizer where we construct v2 scans. As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables. Not all rules will be specific to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. That's why it seems better to introduce a new batch and use it for all rewrites. The name is generic so that we don't limit ourselves to v2 data sources only.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The change is trivial and SPARK-23889 will depend on it.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

aokolnychyi · 2020-11-30T20:52:52Z

cc @dbtsai @dongjoon-hyun @rdblue @sunchao @viirya @cloud-fan @HeartSaVioR

aokolnychyi · 2020-11-30T21:03:19Z

sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala

+   *
+   * Note that this may NOT depend on the `optimizer` function.
+   */
+  protected def customV2SourceRewriteRules: Seq[Rule[LogicalPlan]] = Nil


I am going to add a way to inject a custom rule through extensions here in a follow-up PR.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

dongjoon-hyun · 2020-12-01T08:28:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

    operatorOptimizationBatch) :+
+    // This batch rewrites plans and should be run after the operator
+    // optimization batch and before any batches that depend on stats.
+    Batch("Rewrite Rules", Once, rewriteRules: _*) :+


Basically, every optimizer rule is rewriting the plans, isn't it?

Hmm, same question. This naming looks too general.

Well, you are right after removing v2Source from the name is it a bit too abstract. Any suggestions, @dongjoon-hyun @viirya ? I am not sure making it specific to v2 sources is a good idea too.

I decided to call it dataSourceRewriteRules as this seems to be generic enough but yet describes what it is.

Looks better. Thanks for the change.

cloud-fan · 2020-12-01T13:09:00Z

As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables.

Can you give one example? Why can't the rewrite rules be put in the main optimization batch?

aokolnychyi · 2020-12-01T13:20:34Z

Can you give one example? Why can't the rewrite rules be put in the main optimization batch?

PR #29066 I linked is one example. We want to construct writes after all expressions have been properly optimized.

aokolnychyi · 2020-12-01T13:22:44Z

I am going to provide a hook to this place through session extensions. That's why a separate batch seems like a good idea.

dongjoon-hyun

+1, LGTM again.
I saw that @rdblue 's previous comment was addressed too
Merged to master for Apache Spark 3.1.0.

viirya · 2020-12-01T17:34:35Z

lgtm too.

rdblue · 2020-12-01T17:34:52Z

+1, thanks for updating the batch order.

Thanks for reviewing and merging, @dongjoon-hyun!

gatorsmile · 2020-12-02T04:16:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

    operatorOptimizationBatch) :+
+    // This batch rewrites data source plans and should be run after the operator
+    // optimization batch and before any batches that depend on stats.
+    Batch("Data Source Rewrite Rules", Once, dataSourceRewriteRules: _*) :+


Basically, what you want to do is to add an extension point/batch between heuristics-based optimizer and cost-based optimizer.

The batch name and comments look not good to me. We need a better name here.

cc @maryannxue @hvanhovell

Could you propose a name then, @gatorsmile ?

I am open to alternatives.

We should probably combine this batch with the one below: earlyScanPushDownRules, and give it a more general name similar to extendedOperatorOptimizationRules.

I'd vote for preCBORules.

According to the current discussion, could you make a follow-up PR with preCBORules, @aokolnychyi ? If we have a PR, it would be easier to make a final decision.

I don't think that preCBORules is a good name. This batch is for rewrites that need to happen after basic optimization simplifies expressions and then pushes filters and projections. It also needs to happen before early pushdown, which in turn needs to come before CBO. All of that is before CBO, and that name doesn't capture what this is to be used for.

A more descriptive name is "planRewriteRules" because this is for rewriting plans after initial optimization, but before other optimizer rules that need to run after that rewrite, like early pushdown, CBO, etc.

The name "postOperatorOptimizationRules" is okay, but not very descriptive.

Let's finish this discussion in a separate PR. I'll create one now.

I created PR #30808

gatorsmile · 2020-12-02T04:19:34Z

sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala

+   *
+   * Note that this may NOT depend on the `optimizer` function.
+   */
+  protected def customDataSourceRewriteRules: Seq[Rule[LogicalPlan]] = Nil


This name does not explain the goal. IMO, this is misleading. Let us make the API name more general. It is not related to the data sources.

### What changes were proposed in this pull request? This PR tries to rename `dataSourceRewriteRules` into something more generic. ### Why are the changes needed? These changes are needed to address the post-review discussion [here](#30558 (comment)). ### Does this PR introduce _any_ user-facing change? Yes but the changes haven't been released yet. ### How was this patch tested? Existing tests. Closes #30808 from aokolnychyi/spark-33784. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

This PR tries to rename `dataSourceRewriteRules` into something more generic. These changes are needed to address the post-review discussion [here](apache#30558 (comment)). Yes but the changes haven't been released yet. Existing tests. Closes apache#30808 from aokolnychyi/spark-33784. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-33612][SQL] Add v2SourceRewriteRules batch to Optimizer

b340917

aokolnychyi commented Nov 30, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

github-actions bot added the SQL label Nov 30, 2020

aokolnychyi commented Nov 30, 2020

View reviewed changes

rdblue reviewed Nov 30, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala Outdated Show resolved Hide resolved

dongjoon-hyun approved these changes Nov 30, 2020

View reviewed changes

Rework approach

ec31aec

aokolnychyi changed the title ~~[SPARK-33612][SQL] Add v2SourceRewriteRules batch to Optimizer~~ [SPARK-33612][SQL] Add rewriteRules batch to Optimizer Dec 1, 2020

dongjoon-hyun reviewed Dec 1, 2020

View reviewed changes

aokolnychyi mentioned this pull request Dec 1, 2020

[SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes #29066

Closed

Change name

11f07cd

aokolnychyi changed the title ~~[SPARK-33612][SQL] Add rewriteRules batch to Optimizer~~ [SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer Dec 1, 2020

cloud-fan approved these changes Dec 1, 2020

View reviewed changes

dongjoon-hyun approved these changes Dec 1, 2020

View reviewed changes

dongjoon-hyun closed this in c24f2b2 Dec 1, 2020

gatorsmile reviewed Dec 2, 2020

View reviewed changes

aokolnychyi mentioned this pull request Dec 2, 2020

[SPARK-33621][SQL] Add a way to inject data source rewrite rules #30577

Closed

aokolnychyi mentioned this pull request Dec 16, 2020

[SPARK-33784][SQL] Rename dataSourceRewriteRules batch #30808

Closed

aokolnychyi mentioned this pull request Dec 24, 2020

[SPARK-33621][SPARK-33784][SQL][3.1] Add a way to inject data source rewrite rules #30917

Closed

Uh oh!

[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer #30558

[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer #30558

Uh oh!

Conversation

aokolnychyi commented Nov 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

aokolnychyi commented Nov 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 1, 2020

Uh oh!

aokolnychyi commented Dec 1, 2020

Uh oh!

aokolnychyi commented Dec 1, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Dec 1, 2020

Uh oh!

rdblue commented Dec 1, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

aokolnychyi commented Nov 30, 2020 •

edited

Loading

aokolnychyi commented Nov 30, 2020 •

edited

Loading

aokolnychyi Dec 1, 2020 •

edited

Loading

gatorsmile Dec 2, 2020 •

edited

Loading