Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Nov 30, 2020

What changes were proposed in this pull request?

This PR adds a new batch to the optimizer for executing rules that rewrite plans for data sources.

Why are the changes needed?

Right now, we have a special place in the optimizer where we construct v2 scans. As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables. Not all rules will be specific to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. That's why it seems better to introduce a new batch and use it for all rewrites. The name is generic so that we don't limit ourselves to v2 data sources only.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The change is trivial and SPARK-23889 will depend on it.

@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Nov 30, 2020

@github-actions github-actions bot added the SQL label Nov 30, 2020
*
* Note that this may NOT depend on the `optimizer` function.
*/
protected def customV2SourceRewriteRules: Seq[Rule[LogicalPlan]] = Nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to add a way to inject a custom rule through extensions here in a follow-up PR.

@aokolnychyi aokolnychyi changed the title [SPARK-33612][SQL] Add v2SourceRewriteRules batch to Optimizer [SPARK-33612][SQL] Add rewriteRules batch to Optimizer Dec 1, 2020
operatorOptimizationBatch) :+
// This batch rewrites plans and should be run after the operator
// optimization batch and before any batches that depend on stats.
Batch("Rewrite Rules", Once, rewriteRules: _*) :+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, every optimizer rule is rewriting the plans, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, same question. This naming looks too general.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you are right after removing v2Source from the name is it a bit too abstract. Any suggestions, @dongjoon-hyun @viirya ? I am not sure making it specific to v2 sources is a good idea too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to call it dataSourceRewriteRules as this seems to be generic enough but yet describes what it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better. Thanks for the change.

@aokolnychyi aokolnychyi changed the title [SPARK-33612][SQL] Add rewriteRules batch to Optimizer [SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer Dec 1, 2020
@cloud-fan
Copy link
Contributor

As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables.

Can you give one example? Why can't the rewrite rules be put in the main optimization batch?

@aokolnychyi
Copy link
Contributor Author

Can you give one example? Why can't the rewrite rules be put in the main optimization batch?

PR #29066 I linked is one example. We want to construct writes after all expressions have been properly optimized.

@aokolnychyi
Copy link
Contributor Author

I am going to provide a hook to this place through session extensions. That's why a separate batch seems like a good idea.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM again.
I saw that @rdblue 's previous comment was addressed too
Merged to master for Apache Spark 3.1.0.

@viirya
Copy link
Member

viirya commented Dec 1, 2020

lgtm too.

@rdblue
Copy link
Contributor

rdblue commented Dec 1, 2020

+1, thanks for updating the batch order.

Thanks for reviewing and merging, @dongjoon-hyun!

operatorOptimizationBatch) :+
// This batch rewrites data source plans and should be run after the operator
// optimization batch and before any batches that depend on stats.
Batch("Data Source Rewrite Rules", Once, dataSourceRewriteRules: _*) :+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, what you want to do is to add an extension point/batch between heuristics-based optimizer and cost-based optimizer.

The batch name and comments look not good to me. We need a better name here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you propose a name then, @gatorsmile ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to alternatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably combine this batch with the one below: earlyScanPushDownRules, and give it a more general name similar to extendedOperatorOptimizationRules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for preCBORules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the current discussion, could you make a follow-up PR with preCBORules, @aokolnychyi ? If we have a PR, it would be easier to make a final decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that preCBORules is a good name. This batch is for rewrites that need to happen after basic optimization simplifies expressions and then pushes filters and projections. It also needs to happen before early pushdown, which in turn needs to come before CBO. All of that is before CBO, and that name doesn't capture what this is to be used for.

A more descriptive name is "planRewriteRules" because this is for rewriting plans after initial optimization, but before other optimizer rules that need to run after that rewrite, like early pushdown, CBO, etc.

The name "postOperatorOptimizationRules" is okay, but not very descriptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's finish this discussion in a separate PR. I'll create one now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created PR #30808

*
* Note that this may NOT depend on the `optimizer` function.
*/
protected def customDataSourceRewriteRules: Seq[Rule[LogicalPlan]] = Nil
Copy link
Member

@gatorsmile gatorsmile Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name does not explain the goal. IMO, this is misleading. Let us make the API name more general. It is not related to the data sources.

cloud-fan pushed a commit that referenced this pull request Dec 22, 2020
### What changes were proposed in this pull request?

This PR tries to rename `dataSourceRewriteRules` into something more generic.

### Why are the changes needed?

These changes are needed to address the post-review discussion [here](#30558 (comment)).

### Does this PR introduce _any_ user-facing change?

Yes but the changes haven't been released yet.

### How was this patch tested?

Existing tests.

Closes #30808 from aokolnychyi/spark-33784.

Authored-by: Anton Okolnychyi <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
aokolnychyi added a commit to aokolnychyi/spark that referenced this pull request Dec 24, 2020
This PR tries to rename `dataSourceRewriteRules` into something more generic.

These changes are needed to address the post-review discussion [here](apache#30558 (comment)).

Yes but the changes haven't been released yet.

Existing tests.

Closes apache#30808 from aokolnychyi/spark-33784.

Authored-by: Anton Okolnychyi <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants