-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer #30558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
| * | ||
| * Note that this may NOT depend on the `optimizer` function. | ||
| */ | ||
| protected def customV2SourceRewriteRules: Seq[Rule[LogicalPlan]] = Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to add a way to inject a custom rule through extensions here in a follow-up PR.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
| operatorOptimizationBatch) :+ | ||
| // This batch rewrites plans and should be run after the operator | ||
| // optimization batch and before any batches that depend on stats. | ||
| Batch("Rewrite Rules", Once, rewriteRules: _*) :+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, every optimizer rule is rewriting the plans, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, same question. This naming looks too general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you are right after removing v2Source from the name is it a bit too abstract. Any suggestions, @dongjoon-hyun @viirya ? I am not sure making it specific to v2 sources is a good idea too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to call it dataSourceRewriteRules as this seems to be generic enough but yet describes what it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better. Thanks for the change.
Can you give one example? Why can't the rewrite rules be put in the main optimization batch? |
PR #29066 I linked is one example. We want to construct writes after all expressions have been properly optimized. |
|
I am going to provide a hook to this place through session extensions. That's why a separate batch seems like a good idea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM again.
I saw that @rdblue 's previous comment was addressed too
Merged to master for Apache Spark 3.1.0.
|
lgtm too. |
|
+1, thanks for updating the batch order. Thanks for reviewing and merging, @dongjoon-hyun! |
| operatorOptimizationBatch) :+ | ||
| // This batch rewrites data source plans and should be run after the operator | ||
| // optimization batch and before any batches that depend on stats. | ||
| Batch("Data Source Rewrite Rules", Once, dataSourceRewriteRules: _*) :+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, what you want to do is to add an extension point/batch between heuristics-based optimizer and cost-based optimizer.
The batch name and comments look not good to me. We need a better name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you propose a name then, @gatorsmile ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably combine this batch with the one below: earlyScanPushDownRules, and give it a more general name similar to extendedOperatorOptimizationRules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for preCBORules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the current discussion, could you make a follow-up PR with preCBORules, @aokolnychyi ? If we have a PR, it would be easier to make a final decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that preCBORules is a good name. This batch is for rewrites that need to happen after basic optimization simplifies expressions and then pushes filters and projections. It also needs to happen before early pushdown, which in turn needs to come before CBO. All of that is before CBO, and that name doesn't capture what this is to be used for.
A more descriptive name is "planRewriteRules" because this is for rewriting plans after initial optimization, but before other optimizer rules that need to run after that rewrite, like early pushdown, CBO, etc.
The name "postOperatorOptimizationRules" is okay, but not very descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's finish this discussion in a separate PR. I'll create one now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created PR #30808
| * | ||
| * Note that this may NOT depend on the `optimizer` function. | ||
| */ | ||
| protected def customDataSourceRewriteRules: Seq[Rule[LogicalPlan]] = Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name does not explain the goal. IMO, this is misleading. Let us make the API name more general. It is not related to the data sources.
### What changes were proposed in this pull request? This PR tries to rename `dataSourceRewriteRules` into something more generic. ### Why are the changes needed? These changes are needed to address the post-review discussion [here](#30558 (comment)). ### Does this PR introduce _any_ user-facing change? Yes but the changes haven't been released yet. ### How was this patch tested? Existing tests. Closes #30808 from aokolnychyi/spark-33784. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
This PR tries to rename `dataSourceRewriteRules` into something more generic. These changes are needed to address the post-review discussion [here](apache#30558 (comment)). Yes but the changes haven't been released yet. Existing tests. Closes apache#30808 from aokolnychyi/spark-33784. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR adds a new batch to the optimizer for executing rules that rewrite plans for data sources.
Why are the changes needed?
Right now, we have a special place in the optimizer where we construct v2 scans. As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables. Not all rules will be specific to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. That's why it seems better to introduce a new batch and use it for all rewrites. The name is generic so that we don't limit ourselves to v2 data sources only.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
The change is trivial and SPARK-23889 will depend on it.