-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion #21764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #92992 has finished for PR 21764 at commit
|
|
Do you have concrete usecases in your business? Basically, I think the optimizer is a black-box for most users and they don't easily understand how it works correctly when excluding some rules. Are there other database-like systems that implement this kind of interfaces? Even in debugging uses, is it not enough to define an individual test optimizer for each debuggin use?, e.g., spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala Line 33 in e1de341
|
|
Let me give an example. The ticket https://issues.apache.org/jira/browse/SPARK-24624 shows a common issue in which our optimizer does not work well. It is a bug but our users are unable to easily bypass it. The most straightforward way is to disable the rule. |
| ReplaceDeduplicateWithAggregate) :: Nil | ||
| } | ||
|
|
||
| protected def optimizationBatches: Seq[Batch] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In optimizationBatches, some rules can't be excluded. Without them, the affected queries can't be executed. For example,
Batch("Replace Operators", fixedPoint,
ReplaceIntersectWithSemiJoin,
ReplaceExceptWithFilter,
ReplaceExceptWithAntiJoin,
ReplaceDistinctWithAggregate)Can we just introduce a black list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So can I do black list of batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. We need to exclude Batch("Eliminate Distinct"), Batch("Finish Analysis"), Batch("Replace Operators"), Batch("Pullup Correlated Expressions"), and Batch("RewriteSubquery")
| } | ||
| !exclude | ||
| } | ||
| if (batch.rules == filteredRules) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the if, else if and else can be removed? Just return the filtered batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that it is written that way to allow for logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is to:
- avoid unnecessary object creation if all rules have been preserved.
- avoid empty batches if all rules in the batch have been removed.
|
@gatorsmile aha, ok. We need to make this option not BTW, the interfaces to add/delete optimizer rules (addition via |
|
@maropu This is for advanced end users or Spark developers. External conf looks fine, but I have to admit this might be rarely used. BTW, after having this conf, we can deprecate a few internal configurations that are used for disabling specific optimizer rules in SPARK 3.0. |
|
ok, thx for the kind explanation. |
| Batch("Eliminate Distinct", Once, EliminateDistinct) :: | ||
| // Technically some of the rules in Finish Analysis are not optimizer rules and belong more | ||
| // in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime). | ||
| // However, because we also use the analyzer to canonicalized queries (for view definition), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to canonicalized" -> "to canonicalize" ?
| } | ||
| !exclude | ||
| } | ||
| if (batch.rules == filteredRules) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that it is written that way to allow for logging
| val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules") | ||
| .doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " + | ||
| "specified by their rule names and separated by comma. It is not guaranteed that all the " + | ||
| "rules in this configuration will eventually be excluded, as some rules are necessary " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the optimizer at a low level (I'd be one of those users for which it is a blackbox), would you think it would be feasible to enumerate the rules that cannot be excluded ? Maybe even logging a WARNING when validating the config parameters if it contains required rules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion! @gatorsmile's other suggestion was to introduce a blacklist, in which case this enumeration of rules that cannot be excluded can be made possible. I can do a warning as well.
| import org.apache.spark.sql.internal.SQLConf.OPTIMIZER_EXCLUDED_RULES | ||
|
|
||
|
|
||
| class OptimizerRuleExclusionSuite extends PlanTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any test case for when a required rule is being passed as a "to be excluded" rule ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added :)
| } | ||
| } | ||
|
|
||
| val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow this here, this will affect Spark's caching/uncaching plans and tables inconsistently. For the purpose of this PR, StaticSQLConf.scala would be a perfect place for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about SQL cache? I don't think optimizer has anything to do with SQL cache though, since the logical plans used to match cache entries are "analyzed" plans not "optimized" plans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since an optimizer should not change query semantics(results), it should work well for the case @dongjoon-hyun described. If this is mainly used for debugging uses, I think it would be nice to use this conf on runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on debugging purpose. Still, CacheManager matches the analyzed plan not the optimized plan.
|
Test build #93217 has finished for PR 21764 at commit
|
|
retest this please |
|
btw, I feel the title is a little obscure and how about |
| "Finish Analysis" :: | ||
| "Replace Operators" :: | ||
| "Pullup Correlated Expressions" :: | ||
| "RewriteSubquery" :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use not rule names but batch names in this black list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change to rule black list.
|
|
||
| override def batches: Seq[Batch] = { | ||
| val excludedRules = | ||
| SQLConf.get.optimizerExcludedRules.toSeq.flatMap(_.split(",").map(_.trim).filter(!_.isEmpty)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: !_.isEmpty -> _.nonEmpty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you need to handle case-sensitivity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an auto-generated field ruleName in Rule, so we do exact name matching (case sensitive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use Utils.stringToSeq?
|
Test build #93220 has finished for PR 21764 at commit
|
|
Test build #93400 has finished for PR 21764 at commit
|
| RewritePredicateSubquery.ruleName :: | ||
| ColumnPruning.ruleName :: | ||
| CollapseProject.ruleName :: | ||
| RemoveRedundantProject.ruleName :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the last three?
|
|
||
| override def batches: Seq[Batch] = { | ||
| val excludedRulesConf = | ||
| SQLConf.get.optimizerExcludedRules.toSeq.flatMap(_.split(",").map(_.trim).filter(_.nonEmpty)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use Utils.stringToSeq?
#21764 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason. It's just I didn't know about it. Thank you for point this out!
|
Also, can you update the title? You need to at least add |
|
LGTM pending Jenkins |
|
Test build #93419 has finished for PR 21764 at commit
|
|
LGTM, too |
|
Test build #93428 has finished for PR 21764 at commit
|
|
retest this please |
|
Test build #93435 has finished for PR 21764 at commit
|
|
Thanks! Merged to master. |
## What changes were proposed in this pull request? Since Spark has provided fairly clear interfaces for adding user-defined optimization rules, it would be nice to have an easy-to-use interface for excluding an optimization rule from the Spark query optimizer as well. This would make customizing Spark optimizer easier and sometimes could debugging issues too. - Add a new config spark.sql.optimizer.excludedRules, with the value being a list of rule names separated by comma. - Modify the current batches method to remove the excluded rules from the default batches. Log the rules that have been excluded. - Split the existing default batches into "post-analysis batches" and "optimization batches" so that only rules in the "optimization batches" can be excluded. ## How was this patch tested? Add a new test suite: OptimizerRuleExclusionSuite Author: maryannxue <[email protected]> Closes apache#21764 from maryannxue/rule-exclusion.
|
Hello, It is not very clear for me, how to exclude some rules. I have been digging into the testing file for OptimizerRuleExclusionSuite.scala and not able to exclude some rules. As the jira report for issue SPARK-24802 says:
I tried this: but it is not working since there is not such a method Thank you for any help you can provide. |
|
Hi, @toderesa97. You can use it like this; |
What changes were proposed in this pull request?
Since Spark has provided fairly clear interfaces for adding user-defined optimization rules, it would be nice to have an easy-to-use interface for excluding an optimization rule from the Spark query optimizer as well.
This would make customizing Spark optimizer easier and sometimes could debugging issues too.
How was this patch tested?
Add a new test suite: OptimizerRuleExclusionSuite