Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

What changes were proposed in this pull request?

This PR tries to rename dataSourceRewriteRules into something more generic.

Why are the changes needed?

These changes are needed to address the post-review discussion here.

Does this PR introduce any user-facing change?

Yes but the changes haven't been released yet.

How was this patch tested?

Existing tests.

*/
def injectDataSourceRewriteRule(builder: RuleBuilder): Unit = {
dataSourceRewriteRules += builder
def injectPostOperatorOptimizationRule(builder: RuleBuilder): Unit = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't final.

I'll explain my current thinking here. We are going to use this batch to rewrite plans directly after operator optimization. I did not go for planRewriteRules due to the comment made by @dongjoon-hyun and @viirya (please let me know if you don't feel that way anymore). I also did not go for preCBORules as there quite a few things that will happen before CBO like discussed in the original thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I do not like this name. Our catalyst is designed as a query compiler. There does not exist a concept in a query compiler called "Post Operator Optimization". If you talk about this extension point in a Database class, no one understands what it means from the name.

My point is still the same. Here, we are adding the extension point for advanced users to add the rules between the rule-based optimizer (RBO) rules and the cost-based optimizer (CBO) rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch/function/rule names in Optimizer.scala is not critical to me. They are private. We can change them whenever we need. However, this API is an external developer API. We should be very careful to name it.

To me, either preCBORules or postRBORules are much better than postOperatorOptimizationRule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, eventually, we should move all the statistics based rules from the optimizer to the planner

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that we don't have a clear separation between RBO and CBO. There are RBO rules before and after the only CBO rule CostBasedJoinReorder.

I think the general idea of this batch is to allow people to inject special optimizer rules that can't be run together with the main operator optimizer batch. It's indeed a Spark specific thing, as the main operator optimizer batch will be run many times until reaching the fixed point, the new batch added here will be run only once.

It's really hard to do the naming here. To match the actual purpose and to be general, how about Phase 2 Optimizer Rules or Run Once Optimizer Rules?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 2 Optimizer Rules or Run Once Optimizer Rules does not explain the location of the batch.

How many batches do we want to add? I am afraid we will add phase 2, 3, 4, .... This is endless. It looks very random. We should not expect the users need to understand/read the source code of our optimizer every new release. It is also fragile.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do share some of the points brought by @cloud-fan. There is no clear separation between CBO and RBO and we run quite some rule-based optimizations after the existing cost-based optimization rule. Also, we need to run this batch before early scan pushdown rules to capture possible rewrites, which, in turn, run before CBO. That’s why preCBORules does not seem to ideally convey what this rule does.

Our existing hook in the extensions API says The injected rules will be executed during the operator optimization batch. That was one of my motivations for the current name as we have resolution and post-hoc resolution rules.

That said, I don’t mean postOperatorOptimizationRules is a perfect name either. It just seems to better reflect the place in the Spark optimizer. I’ll be happy to discuss and iterate further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be also interested to hear from @hvanhovell and other folks who commented.

@aokolnychyi
Copy link
Contributor Author

@dongjoon-hyun
Copy link
Member

Thank you for making a follow-up, @aokolnychyi . I believe we can reach an agreement here.

@rdblue
Copy link
Contributor

rdblue commented Dec 16, 2020

I'm fine with post operator optimization.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds better to me as the name indicates the position of the batch should be.

@SparkQA
Copy link

SparkQA commented Dec 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37499/

@github-actions github-actions bot added the SQL label Dec 16, 2020
@SparkQA
Copy link

SparkQA commented Dec 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37499/

@SparkQA
Copy link

SparkQA commented Dec 16, 2020

Test build #132897 has finished for PR 30808 at commit 8de3958.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. @gatorsmile @maryannxue @cloud-fan do you have any though?

@gatorsmile
Copy link
Member

Let us mark as WIP before we get an agreement on the API.

@gatorsmile gatorsmile changed the title [SPARK-33784][SQL] Rename dataSourceRewriteRules batch [WIP] [SPARK-33784][SQL] Rename dataSourceRewriteRules batch Dec 17, 2020
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, All.

All of your comments are important to us and @aokolnychyi has been trying to embrace your opinions and to lead the discussion on the naming according to @gatorsmile 's comment. (#30558 (comment))

The batch name and comments look not good to me. We need a better name here.

Since this issue (SPARK-33784) is also considered as an blocker by @HyukjinKwon , let's try to summarize once more. The following is a brief summary. Please feel free to add.

Name Description
customRewriteRules The initial naming from the author
customDataSourceRewriteRules The initial commit with 5 +1
postOperatorOptimizationRules This PR
preCBORules 3 +1 and 1 -1
planRewriteRules, preCostBasedOptimizationRules The other proposed alternatives

AFAIK, according to the current comments, preCBORules was the most actively supported by three people as the better alternative and vetoed by one.

However, the committed name, customDataSourceRewriteRules, also had +5 approval.

In terms of veto status, although there are unclear parts between the communication, the following two looks outstanding.

  • @gatorsmile vetoed dataSourceRewriteRules after +5 approval
  • @rdblue vetoed preCBORules after +3 positive comments.

Do you think you can switch your decision one way or another to make an agreement for Apache Spark 3.1.0, @gatorsmile and @rdblue ? Given the design of AS-IS Spark, I don't think we can find a perfect name in Spark 3.2 timeframe, either.

@rdblue
Copy link
Contributor

rdblue commented Dec 19, 2020

I think that preCBORules is a bad name, but I am fine with using it to move forward.

@dongjoon-hyun
Copy link
Member

Thank you for making us move forward, @rdblue .

@aokolnychyi , could you update this PR once more to get further comments?

@dongjoon-hyun
Copy link
Member

Also, @gatorsmile . Please let us know what you think about the AS-IS direction.

@HyukjinKwon
Copy link
Member

I am okay with preCBORules too. Looks like we're okay with this name.

@aokolnychyi
Copy link
Contributor Author

I am going to update this PR to use preCBORules if there is enough consesus. I did have similar concerns to what @cloud-fan mentioned in this comment but I don't have an ideal alternative.

Let me change the name and unblock this PR.

* The injected rules will be executed after the operator optimization batch and before rules
* that depend on stats.
* Inject an optimizer `Rule` builder that rewrites logical plans into the [[SparkSession]].
* The injected rules will be executed once after the operator optimization batch and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that the rule will be executed once to the docs.

@HyukjinKwon HyukjinKwon changed the title [WIP] [SPARK-33784][SQL] Rename dataSourceRewriteRules batch [SPARK-33784][SQL] Rename dataSourceRewriteRules batch Dec 21, 2020
@HyukjinKwon
Copy link
Member

Seems we reached to the agreement. I am removing WIP.

@aokolnychyi
Copy link
Contributor Author

I've updated this PR.

Thanks everyone who participated in the discussion and @dongjoon-hyun for creating a summary to move this forward.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update. I hope we can have this functionality in Apache Spark 3.1.0.

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37748/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37748/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Test build #133149 has finished for PR 30808 at commit dcd0b7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 22, 2020

thanks, merging to master!

@cloud-fan cloud-fan closed this in 7bbcbb8 Dec 22, 2020
@cloud-fan
Copy link
Contributor

It conflicts with 3.1, @aokolnychyi can you open a backport PR? thanks!

@aokolnychyi
Copy link
Contributor Author

Yep, will do today, @cloud-fan.

@dongjoon-hyun
Copy link
Member

Thank you all for the decision!

@dongjoon-hyun
Copy link
Member

Gentle ping, @aokolnychyi .

@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Dec 24, 2020

@dongjoon-hyun @cloud-fan, shall I cherry-pick this change in full or shall I cherry-pick only part that is present in 3.1?

Like I mentioned earlier, the second part of this change (the hook to inject custom rules) was created before 3.1 was cut but was merged a couple of days after. I think it only makes sense to have the new batch in 3.1 if we also add the hook.

aokolnychyi added a commit to aokolnychyi/spark that referenced this pull request Dec 24, 2020
This PR tries to rename `dataSourceRewriteRules` into something more generic.

These changes are needed to address the post-review discussion [here](apache#30558 (comment)).

Yes but the changes haven't been released yet.

Existing tests.

Closes apache#30808 from aokolnychyi/spark-33784.

Authored-by: Anton Okolnychyi <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@dongjoon-hyun
Copy link
Member

Sorry for late response. The new backporting PR (SPARK-33621 and SPARK-33784) looks good to me and it landed at branch-3.1 for Apache Spark 3.1.0 a few minutes ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants