-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33784][SQL] Rename dataSourceRewriteRules batch #30808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| */ | ||
| def injectDataSourceRewriteRule(builder: RuleBuilder): Unit = { | ||
| dataSourceRewriteRules += builder | ||
| def injectPostOperatorOptimizationRule(builder: RuleBuilder): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't final.
I'll explain my current thinking here. We are going to use this batch to rewrite plans directly after operator optimization. I did not go for planRewriteRules due to the comment made by @dongjoon-hyun and @viirya (please let me know if you don't feel that way anymore). I also did not go for preCBORules as there quite a few things that will happen before CBO like discussed in the original thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I do not like this name. Our catalyst is designed as a query compiler. There does not exist a concept in a query compiler called "Post Operator Optimization". If you talk about this extension point in a Database class, no one understands what it means from the name.
My point is still the same. Here, we are adding the extension point for advanced users to add the rules between the rule-based optimizer (RBO) rules and the cost-based optimizer (CBO) rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batch/function/rule names in Optimizer.scala is not critical to me. They are private. We can change them whenever we need. However, this API is an external developer API. We should be very careful to name it.
To me, either preCBORules or postRBORules are much better than postOperatorOptimizationRule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, eventually, we should move all the statistics based rules from the optimizer to the planner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate that we don't have a clear separation between RBO and CBO. There are RBO rules before and after the only CBO rule CostBasedJoinReorder.
I think the general idea of this batch is to allow people to inject special optimizer rules that can't be run together with the main operator optimizer batch. It's indeed a Spark specific thing, as the main operator optimizer batch will be run many times until reaching the fixed point, the new batch added here will be run only once.
It's really hard to do the naming here. To match the actual purpose and to be general, how about Phase 2 Optimizer Rules or Run Once Optimizer Rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phase 2 Optimizer Rules or Run Once Optimizer Rules does not explain the location of the batch.
How many batches do we want to add? I am afraid we will add phase 2, 3, 4, .... This is endless. It looks very random. We should not expect the users need to understand/read the source code of our optimizer every new release. It is also fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do share some of the points brought by @cloud-fan. There is no clear separation between CBO and RBO and we run quite some rule-based optimizations after the existing cost-based optimization rule. Also, we need to run this batch before early scan pushdown rules to capture possible rewrites, which, in turn, run before CBO. That’s why preCBORules does not seem to ideally convey what this rule does.
Our existing hook in the extensions API says The injected rules will be executed during the operator optimization batch. That was one of my motivations for the current name as we have resolution and post-hoc resolution rules.
That said, I don’t mean postOperatorOptimizationRules is a perfect name either. It just seems to better reflect the place in the Spark optimizer. I’ll be happy to discuss and iterate further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be also interested to hear from @hvanhovell and other folks who commented.
|
Thank you for making a follow-up, @aokolnychyi . I believe we can reach an agreement here. |
|
I'm fine with post operator optimization. |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds better to me as the name indicates the position of the batch should be.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #132897 has finished for PR 30808 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. @gatorsmile @maryannxue @cloud-fan do you have any though?
|
Let us mark as WIP before we get an agreement on the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, All.
All of your comments are important to us and @aokolnychyi has been trying to embrace your opinions and to lead the discussion on the naming according to @gatorsmile 's comment. (#30558 (comment))
The batch name and comments look not good to me. We need a better name here.
Since this issue (SPARK-33784) is also considered as an blocker by @HyukjinKwon , let's try to summarize once more. The following is a brief summary. Please feel free to add.
| Name | Description |
|---|---|
| customRewriteRules | The initial naming from the author |
| customDataSourceRewriteRules | The initial commit with 5 +1 |
| postOperatorOptimizationRules | This PR |
| preCBORules | 3 +1 and 1 -1 |
| planRewriteRules, preCostBasedOptimizationRules | The other proposed alternatives |
AFAIK, according to the current comments, preCBORules was the most actively supported by three people as the better alternative and vetoed by one.
- @maryannxue proposed #30558 (comment)
- @cloud-fan wrote
I'd vote for preCBORules. - @gatorsmile wrote
To me, either preCBORules or postRBORules are much better than postOperatorOptimizationRule - @rdblue wrote
I don't think that preCBORules is a good name.
However, the committed name, customDataSourceRewriteRules, also had +5 approval.
In terms of veto status, although there are unclear parts between the communication, the following two looks outstanding.
- @gatorsmile vetoed
dataSourceRewriteRulesafter +5 approval - @rdblue vetoed
preCBORulesafter +3 positive comments.
Do you think you can switch your decision one way or another to make an agreement for Apache Spark 3.1.0, @gatorsmile and @rdblue ? Given the design of AS-IS Spark, I don't think we can find a perfect name in Spark 3.2 timeframe, either.
|
I think that |
|
Thank you for making us move forward, @rdblue . @aokolnychyi , could you update this PR once more to get further comments? |
|
Also, @gatorsmile . Please let us know what you think about the AS-IS direction. |
|
I am okay with |
|
I am going to update this PR to use Let me change the name and unblock this PR. |
| * The injected rules will be executed after the operator optimization batch and before rules | ||
| * that depend on stats. | ||
| * Inject an optimizer `Rule` builder that rewrites logical plans into the [[SparkSession]]. | ||
| * The injected rules will be executed once after the operator optimization batch and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added that the rule will be executed once to the docs.
|
Seems we reached to the agreement. I am removing |
|
I've updated this PR. Thanks everyone who participated in the discussion and @dongjoon-hyun for creating a summary to move this forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for update. I hope we can have this functionality in Apache Spark 3.1.0.
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133149 has finished for PR 30808 at commit
|
|
thanks, merging to master! |
|
It conflicts with 3.1, @aokolnychyi can you open a backport PR? thanks! |
|
Yep, will do today, @cloud-fan. |
|
Thank you all for the decision! |
|
Gentle ping, @aokolnychyi . |
|
@dongjoon-hyun @cloud-fan, shall I cherry-pick this change in full or shall I cherry-pick only part that is present in 3.1? Like I mentioned earlier, the second part of this change (the hook to inject custom rules) was created before 3.1 was cut but was merged a couple of days after. I think it only makes sense to have the new batch in 3.1 if we also add the hook. |
This PR tries to rename `dataSourceRewriteRules` into something more generic. These changes are needed to address the post-review discussion [here](apache#30558 (comment)). Yes but the changes haven't been released yet. Existing tests. Closes apache#30808 from aokolnychyi/spark-33784. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
|
Sorry for late response. The new backporting PR (SPARK-33621 and SPARK-33784) looks good to me and it landed at branch-3.1 for Apache Spark 3.1.0 a few minutes ago. |
What changes were proposed in this pull request?
This PR tries to rename
dataSourceRewriteRulesinto something more generic.Why are the changes needed?
These changes are needed to address the post-review discussion here.
Does this PR introduce any user-facing change?
Yes but the changes haven't been released yet.
How was this patch tested?
Existing tests.