[SPARK-33784][SQL] Rename dataSourceRewriteRules batch #30808

aokolnychyi · 2020-12-16T16:55:02Z

What changes were proposed in this pull request?

This PR tries to rename dataSourceRewriteRules into something more generic.

Why are the changes needed?

These changes are needed to address the post-review discussion here.

Does this PR introduce any user-facing change?

Yes but the changes haven't been released yet.

How was this patch tested?

Existing tests.

aokolnychyi · 2020-12-16T17:00:15Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala

   */
-  def injectDataSourceRewriteRule(builder: RuleBuilder): Unit = {
-    dataSourceRewriteRules += builder
+  def injectPostOperatorOptimizationRule(builder: RuleBuilder): Unit = {


This isn't final.

I'll explain my current thinking here. We are going to use this batch to rewrite plans directly after operator optimization. I did not go for planRewriteRules due to the comment made by @dongjoon-hyun and @viirya (please let me know if you don't feel that way anymore). I also did not go for preCBORules as there quite a few things that will happen before CBO like discussed in the original thread.

To be honest, I do not like this name. Our catalyst is designed as a query compiler. There does not exist a concept in a query compiler called "Post Operator Optimization". If you talk about this extension point in a Database class, no one understands what it means from the name.

My point is still the same. Here, we are adding the extension point for advanced users to add the rules between the rule-based optimizer (RBO) rules and the cost-based optimizer (CBO) rules.

The batch/function/rule names in Optimizer.scala is not critical to me. They are private. We can change them whenever we need. However, this API is an external developer API. We should be very careful to name it.

To me, either preCBORules or postRBORules are much better than postOperatorOptimizationRule.

BTW, eventually, we should move all the statistics based rules from the optimizer to the planner

It's unfortunate that we don't have a clear separation between RBO and CBO. There are RBO rules before and after the only CBO rule CostBasedJoinReorder.

I think the general idea of this batch is to allow people to inject special optimizer rules that can't be run together with the main operator optimizer batch. It's indeed a Spark specific thing, as the main operator optimizer batch will be run many times until reaching the fixed point, the new batch added here will be run only once.

It's really hard to do the naming here. To match the actual purpose and to be general, how about Phase 2 Optimizer Rules or Run Once Optimizer Rules?

Phase 2 Optimizer Rules or Run Once Optimizer Rules does not explain the location of the batch.

How many batches do we want to add? I am afraid we will add phase 2, 3, 4, .... This is endless. It looks very random. We should not expect the users need to understand/read the source code of our optimizer every new release. It is also fragile.

I do share some of the points brought by @cloud-fan. There is no clear separation between CBO and RBO and we run quite some rule-based optimizations after the existing cost-based optimization rule. Also, we need to run this batch before early scan pushdown rules to capture possible rewrites, which, in turn, run before CBO. That’s why preCBORules does not seem to ideally convey what this rule does.

Our existing hook in the extensions API says The injected rules will be executed during the operator optimization batch. That was one of my motivations for the current name as we have resolution and post-hoc resolution rules.

That said, I don’t mean postOperatorOptimizationRules is a perfect name either. It just seems to better reflect the place in the Spark optimizer. I’ll be happy to discuss and iterate further.

I'd be also interested to hear from @hvanhovell and other folks who commented.

aokolnychyi · 2020-12-16T17:01:40Z

cc @rdblue @HyukjinKwon @dongjoon-hyun @cloud-fan @viirya @dbtsai @sunchao @gatorsmile @maryannxue

dongjoon-hyun · 2020-12-16T17:03:34Z

Thank you for making a follow-up, @aokolnychyi . I believe we can reach an agreement here.

rdblue · 2020-12-16T17:40:20Z

I'm fine with post operator optimization.

viirya

Sounds better to me as the name indicates the position of the batch should be.

SparkQA · 2020-12-16T18:06:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37499/

SparkQA · 2020-12-16T18:39:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37499/

SparkQA · 2020-12-16T21:24:40Z

Test build #132897 has finished for PR 30808 at commit 8de3958.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

This looks good to me. @gatorsmile @maryannxue @cloud-fan do you have any though?

gatorsmile · 2020-12-17T04:00:55Z

Let us mark as WIP before we get an agreement on the API.

dongjoon-hyun

Hi, All.

All of your comments are important to us and @aokolnychyi has been trying to embrace your opinions and to lead the discussion on the naming according to @gatorsmile 's comment. (#30558 (comment))

The batch name and comments look not good to me. We need a better name here.

Since this issue (SPARK-33784) is also considered as an blocker by @HyukjinKwon , let's try to summarize once more. The following is a brief summary. Please feel free to add.

Name	Description
customRewriteRules	The initial naming from the author
customDataSourceRewriteRules	The initial commit with 5 `+1`
postOperatorOptimizationRules	This PR
preCBORules	3 `+1` and 1 `-1`
planRewriteRules, preCostBasedOptimizationRules	The other proposed alternatives

AFAIK, according to the current comments, preCBORules was the most actively supported by three people as the better alternative and vetoed by one.

@maryannxue proposed #30558 (comment)
@cloud-fan wrote I'd vote for preCBORules.
@gatorsmile wrote To me, either preCBORules or postRBORules are much better than postOperatorOptimizationRule
@rdblue wrote I don't think that preCBORules is a good name.

However, the committed name, customDataSourceRewriteRules, also had +5 approval.

In terms of veto status, although there are unclear parts between the communication, the following two looks outstanding.

@gatorsmile vetoed dataSourceRewriteRules after +5 approval
@rdblue vetoed preCBORules after +3 positive comments.

Do you think you can switch your decision one way or another to make an agreement for Apache Spark 3.1.0, @gatorsmile and @rdblue ? Given the design of AS-IS Spark, I don't think we can find a perfect name in Spark 3.2 timeframe, either.

rdblue · 2020-12-19T22:21:16Z

I think that preCBORules is a bad name, but I am fine with using it to move forward.

dongjoon-hyun · 2020-12-19T23:10:25Z

Thank you for making us move forward, @rdblue .

@aokolnychyi , could you update this PR once more to get further comments?

dongjoon-hyun · 2020-12-19T23:13:05Z

Also, @gatorsmile . Please let us know what you think about the AS-IS direction.

HyukjinKwon · 2020-12-21T11:05:30Z

I am okay with preCBORules too. Looks like we're okay with this name.

aokolnychyi · 2020-12-21T11:12:59Z

I am going to update this PR to use preCBORules if there is enough consesus. I did have similar concerns to what @cloud-fan mentioned in this comment but I don't have an ideal alternative.

Let me change the name and unblock this PR.

aokolnychyi · 2020-12-21T11:28:12Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala

-   * The injected rules will be executed after the operator optimization batch and before rules
-   * that depend on stats.
+   * Inject an optimizer `Rule` builder that rewrites logical plans into the [[SparkSession]].
+   * The injected rules will be executed once after the operator optimization batch and


I added that the rule will be executed once to the docs.

HyukjinKwon · 2020-12-21T11:28:38Z

Seems we reached to the agreement. I am removing WIP.

aokolnychyi · 2020-12-21T11:29:19Z

I've updated this PR.

Thanks everyone who participated in the discussion and @dongjoon-hyun for creating a summary to move this forward.

dongjoon-hyun

Thank you for update. I hope we can have this functionality in Apache Spark 3.1.0.

SparkQA · 2020-12-21T12:20:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37748/

SparkQA · 2020-12-21T12:51:32Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37748/

SparkQA · 2020-12-21T15:54:18Z

Test build #133149 has finished for PR 30808 at commit dcd0b7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-22T08:29:20Z

thanks, merging to master!

cloud-fan · 2020-12-22T08:30:26Z

It conflicts with 3.1, @aokolnychyi can you open a backport PR? thanks!

aokolnychyi · 2020-12-22T09:17:18Z

Yep, will do today, @cloud-fan.

dongjoon-hyun · 2020-12-22T17:51:52Z

Thank you all for the decision!

dongjoon-hyun · 2020-12-23T14:12:43Z

Gentle ping, @aokolnychyi .

aokolnychyi · 2020-12-24T11:49:47Z

@dongjoon-hyun @cloud-fan, shall I cherry-pick this change in full or shall I cherry-pick only part that is present in 3.1?

Like I mentioned earlier, the second part of this change (the hook to inject custom rules) was created before 3.1 was cut but was merged a couple of days after. I think it only makes sense to have the new batch in 3.1 if we also add the hook.

This PR tries to rename `dataSourceRewriteRules` into something more generic. These changes are needed to address the post-review discussion [here](apache#30558 (comment)). Yes but the changes haven't been released yet. Existing tests. Closes apache#30808 from aokolnychyi/spark-33784. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun · 2020-12-24T20:43:01Z

Sorry for late response. The new backporting PR (SPARK-33621 and SPARK-33784) looks good to me and it landed at branch-3.1 for Apache Spark 3.1.0 a few minutes ago.

[SPARK-33784][SQL] Rename dataSourceRewriteRules batch

8de3958

aokolnychyi commented Dec 16, 2020

View reviewed changes

aokolnychyi mentioned this pull request Dec 16, 2020

[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer #30558

Closed

viirya reviewed Dec 16, 2020

View reviewed changes

github-actions bot added the SQL label Dec 16, 2020

HyukjinKwon approved these changes Dec 17, 2020

View reviewed changes

gatorsmile changed the title ~~[SPARK-33784][SQL] Rename dataSourceRewriteRules batch~~ [WIP] [SPARK-33784][SQL] Rename dataSourceRewriteRules batch Dec 17, 2020

dongjoon-hyun reviewed Dec 19, 2020

View reviewed changes

Switch to preCBORules

dcd0b7c

aokolnychyi commented Dec 21, 2020

View reviewed changes

HyukjinKwon changed the title ~~[WIP] [SPARK-33784][SQL] Rename dataSourceRewriteRules batch~~ [SPARK-33784][SQL] Rename dataSourceRewriteRules batch Dec 21, 2020

dongjoon-hyun approved these changes Dec 21, 2020

View reviewed changes

gatorsmile approved these changes Dec 22, 2020

View reviewed changes

cloud-fan approved these changes Dec 22, 2020

View reviewed changes

cloud-fan closed this in 7bbcbb8 Dec 22, 2020

aokolnychyi mentioned this pull request Dec 24, 2020

[SPARK-33621][SPARK-33784][SQL][3.1] Add a way to inject data source rewrite rules #30917

Closed

cloud-fan mentioned this pull request Dec 7, 2021

[SPARK-37518][SQL] Inject an early scan pushdown rule #34779

Closed

[SPARK-33784][SQL] Rename dataSourceRewriteRules batch #30808

[SPARK-33784][SQL] Rename dataSourceRewriteRules batch #30808

Uh oh!

Conversation

aokolnychyi commented Dec 16, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Dec 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Dec 16, 2020

Uh oh!

dongjoon-hyun commented Dec 16, 2020

Uh oh!

rdblue commented Dec 16, 2020

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

SparkQA commented Dec 16, 2020

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 17, 2020

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Dec 19, 2020

Uh oh!

dongjoon-hyun commented Dec 19, 2020

Uh oh!

dongjoon-hyun commented Dec 19, 2020

Uh oh!

HyukjinKwon commented Dec 21, 2020

Uh oh!

aokolnychyi commented Dec 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 21, 2020

Uh oh!

aokolnychyi commented Dec 21, 2020

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 21, 2020

Uh oh!

SparkQA commented Dec 21, 2020

Uh oh!

SparkQA commented Dec 21, 2020

Uh oh!

cloud-fan commented Dec 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

aokolnychyi Dec 17, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

cloud-fan commented Dec 22, 2020 •

edited

Loading

aokolnychyi commented Dec 24, 2020 •

edited

Loading