[SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion #21764

maryannxue · 2018-07-13T21:34:09Z

What changes were proposed in this pull request?

Since Spark has provided fairly clear interfaces for adding user-defined optimization rules, it would be nice to have an easy-to-use interface for excluding an optimization rule from the Spark query optimizer as well.

This would make customizing Spark optimizer easier and sometimes could debugging issues too.

Add a new config spark.sql.optimizer.excludedRules, with the value being a list of rule names separated by comma.
Modify the current batches method to remove the excluded rules from the default batches. Log the rules that have been excluded.
Split the existing default batches into "post-analysis batches" and "optimization batches" so that only rules in the "optimization batches" can be excluded.

How was this patch tested?

Add a new test suite: OptimizerRuleExclusionSuite

SparkQA · 2018-07-14T01:27:04Z

Test build #92992 has finished for PR 21764 at commit eaec2f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-14T14:39:22Z

Do you have concrete usecases in your business? Basically, I think the optimizer is a black-box for most users and they don't easily understand how it works correctly when excluding some rules. Are there other database-like systems that implement this kind of interfaces?

Even in debugging uses, is it not enough to define an individual test optimizer for each debuggin use?, e.g.,

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala

Line 33 in e1de341

object Optimize extends RuleExecutor[LogicalPlan] {

gatorsmile · 2018-07-14T17:49:05Z

Let me give an example. The ticket https://issues.apache.org/jira/browse/SPARK-24624 shows a common issue in which our optimizer does not work well. It is a bug but our users are unable to easily bypass it. The most straightforward way is to disable the rule.

gatorsmile · 2018-07-14T18:01:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      ReplaceDeduplicateWithAggregate) :: Nil
+  }
+
+  protected def optimizationBatches: Seq[Batch] = {


In optimizationBatches, some rules can't be excluded. Without them, the affected queries can't be executed. For example,

Batch("Replace Operators", fixedPoint, ReplaceIntersectWithSemiJoin, ReplaceExceptWithFilter, ReplaceExceptWithAntiJoin, ReplaceDistinctWithAggregate)

Can we just introduce a black list?

So can I do black list of batches?

yes. We need to exclude Batch("Eliminate Distinct"), Batch("Finish Analysis"), Batch("Replace Operators"), Batch("Pullup Correlated Expressions"), and Batch("RewriteSubquery")

gatorsmile · 2018-07-14T18:06:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+            }
+            !exclude
+          }
+        if (batch.rules == filteredRules) {


Maybe the if, else if and else can be removed? Just return the filtered batch?

My understanding is that it is written that way to allow for logging

It is to:

avoid unnecessary object creation if all rules have been preserved.

avoid empty batches if all rules in the batch have been removed.

maropu · 2018-07-15T02:21:43Z

@gatorsmile aha, ok. We need to make this option not internal but external?

BTW, the interfaces to add/delete optimizer rules (addition via ExperimentalMethods and deletion via SQLConf) are different and is this design ok?

gatorsmile · 2018-07-15T04:41:57Z

@maropu This is for advanced end users or Spark developers. External conf looks fine, but I have to admit this might be rarely used. BTW, after having this conf, we can deprecate a few internal configurations that are used for disabling specific optimizer rules in SPARK 3.0.

maropu · 2018-07-15T05:07:55Z

ok, thx for the kind explanation.

dmateusp · 2018-07-15T11:00:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    Batch("Eliminate Distinct", Once, EliminateDistinct) ::
+    // Technically some of the rules in Finish Analysis are not optimizer rules and belong more
+    // in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime).
+    // However, because we also use the analyzer to canonicalized queries (for view definition),


"to canonicalized" -> "to canonicalize" ?

dmateusp · 2018-07-15T11:04:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+            }
+            !exclude
+          }
+        if (batch.rules == filteredRules) {


My understanding is that it is written that way to allow for logging

dmateusp · 2018-07-15T11:08:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")
+    .doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " +
+      "specified by their rule names and separated by comma. It is not guaranteed that all the " +
+      "rules in this configuration will eventually be excluded, as some rules are necessary " +


I don't understand the optimizer at a low level (I'd be one of those users for which it is a blackbox), would you think it would be feasible to enumerate the rules that cannot be excluded ? Maybe even logging a WARNING when validating the config parameters if it contains required rules

Nice suggestion! @gatorsmile's other suggestion was to introduce a blacklist, in which case this enumeration of rules that cannot be excluded can be made possible. I can do a warning as well.

dmateusp · 2018-07-15T11:10:33Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizerRuleExclusionSuite.scala

+import org.apache.spark.sql.internal.SQLConf.OPTIMIZER_EXCLUDED_RULES
+
+
+class OptimizerRuleExclusionSuite extends PlanTest {


Any test case for when a required rule is being passed as a "to be excluded" rule ?

dongjoon-hyun · 2018-07-15T22:09:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    }
  }

+  val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")


If we allow this here, this will affect Spark's caching/uncaching plans and tables inconsistently. For the purpose of this PR, StaticSQLConf.scala would be a perfect place for this.

Are you talking about SQL cache? I don't think optimizer has anything to do with SQL cache though, since the logical plans used to match cache entries are "analyzed" plans not "optimized" plans.

Since an optimizer should not change query semantics(results), it should work well for the case @dongjoon-hyun described. If this is mainly used for debugging uses, I think it would be nice to use this conf on runtime.

+1 on debugging purpose. Still, CacheManager matches the analyzed plan not the optimized plan.

SparkQA · 2018-07-18T07:05:01Z

Test build #93217 has finished for PR 21764 at commit 84f1a6b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-18T07:10:09Z

retest this please

maropu · 2018-07-18T07:11:33Z

btw, I feel the title is a little obscure and how about [SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion?

maropu · 2018-07-18T07:35:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      "Finish Analysis" ::
+      "Replace Operators" ::
+      "Pullup Correlated Expressions" ::
+      "RewriteSubquery" :: Nil


We use not rule names but batch names in this black list?

I'll change to rule black list.

maropu · 2018-07-18T07:39:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+
+  override def batches: Seq[Batch] = {
+    val excludedRules =
+      SQLConf.get.optimizerExcludedRules.toSeq.flatMap(_.split(",").map(_.trim).filter(!_.isEmpty))


nit: !_.isEmpty -> _.nonEmpty

Also, you need to handle case-sensitivity.

There is an auto-generated field ruleName in Rule, so we do exact name matching (case sensitive).

You can use Utils.stringToSeq?

SparkQA · 2018-07-18T11:19:45Z

Test build #93220 has finished for PR 21764 at commit 84f1a6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-22T06:09:25Z

Test build #93400 has finished for PR 21764 at commit b154979.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-23T00:29:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      RewritePredicateSubquery.ruleName ::
+      ColumnPruning.ruleName ::
+      CollapseProject.ruleName ::
+      RemoveRedundantProject.ruleName :: Nil


remove the last three?

maropu · 2018-07-23T03:36:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+
+  override def batches: Seq[Batch] = {
+    val excludedRulesConf =
+      SQLConf.get.optimizerExcludedRules.toSeq.flatMap(_.split(",").map(_.trim).filter(_.nonEmpty))


Any reason not to use Utils.stringToSeq?
#21764 (comment)

No reason. It's just I didn't know about it. Thank you for point this out!

maropu · 2018-07-23T03:39:12Z

Also, can you update the title? You need to at least add [SQL] in the title: #21764 (comment)

gatorsmile · 2018-07-23T04:35:40Z

LGTM pending Jenkins

SparkQA · 2018-07-23T04:40:41Z

Test build #93419 has finished for PR 21764 at commit 87afe4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-23T04:42:36Z

LGTM, too

SparkQA · 2018-07-23T06:26:47Z

Test build #93428 has finished for PR 21764 at commit 39b6ce9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-23T07:13:44Z

retest this please

SparkQA · 2018-07-23T11:17:39Z

Test build #93435 has finished for PR 21764 at commit 39b6ce9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-23T15:26:11Z

Thanks! Merged to master.

## What changes were proposed in this pull request? Since Spark has provided fairly clear interfaces for adding user-defined optimization rules, it would be nice to have an easy-to-use interface for excluding an optimization rule from the Spark query optimizer as well. This would make customizing Spark optimizer easier and sometimes could debugging issues too. - Add a new config spark.sql.optimizer.excludedRules, with the value being a list of rule names separated by comma. - Modify the current batches method to remove the excluded rules from the default batches. Log the rules that have been excluded. - Split the existing default batches into "post-analysis batches" and "optimization batches" so that only rules in the "optimization batches" can be excluded. ## How was this patch tested? Add a new test suite: OptimizerRuleExclusionSuite Author: maryannxue <[email protected]> Closes apache#21764 from maryannxue/rule-exclusion.

toderesa97 · 2020-04-20T18:03:36Z

Hello,

It is not very clear for me, how to exclude some rules. I have been digging into the testing file for OptimizerRuleExclusionSuite.scala and not able to exclude some rules. As the jira report for issue SPARK-24802 says:

Add a new config spark.sql.optimizer.excludedRules, with the value being a list of rule names separated by comma

I tried this:

import org.apache.spark.sql.{SparkSession}
import org.apache.spark.sql.catalyst.optimizer._


object Main {


  def getExcludeRules: Seq[String] = {
    Seq(
      PushPredicateThroughJoin.ruleName
    )
  }

  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession
    .builder
    .config("spark.sql.optimizer.excludedRules", getExcludeRules)
    .master("local[*]")
    .getOrCreate
    val sc = spark.sparkContext
    
    // whatever

  }
}

but it is not working since there is not such a method config that receives a second parameter a Seq[String].

Thank you for any help you can provide.

maropu · 2020-04-20T23:35:43Z

Hi, @toderesa97. You can use it like this;

scala> Seq("abc", "def").toDF("v").write.saveAsTable("t")
scala> sql("SELECT * FROM t WHERE v LIKE '%bc'").explain()
== Physical Plan ==
*(1) Project [v#18]
+- *(1) Filter (isnotnull(v#18) AND EndsWith(v#18, bc))
                                    ^^^^^^^^
   +- *(1) ColumnarToRow
      +- FileScan parquet default.t[v#18] ...

scala> sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.LikeSimplification")

scala> sql("SELECT * FROM t WHERE v LIKE '%bc'").explain()
== Physical Plan ==
*(1) Project [v#18]
+- *(1) Filter (isnotnull(v#18) AND v#18 LIKE %bc)
                                         ^^^^
   +- *(1) ColumnarToRow
      +- FileScan parquet default.t[v#18] ...

[SPARK-24802] Optimization Rule Exclusion

eaec2f5

gatorsmile reviewed Jul 14, 2018

View reviewed changes

dmateusp reviewed Jul 15, 2018

View reviewed changes

dongjoon-hyun reviewed Jul 15, 2018

View reviewed changes

Address review comments

84f1a6b

maropu reviewed Jul 18, 2018

View reviewed changes

maryannxue added 2 commits July 20, 2018 19:37

Address review comments

ff23edf

change test name

b154979

gatorsmile reviewed Jul 23, 2018

View reviewed changes

address review comments

87afe4f

maropu reviewed Jul 23, 2018

View reviewed changes

maryannxue changed the title ~~[SPARK-24802] Optimization Rule Exclusion~~ SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion Jul 23, 2018

maryannxue changed the title ~~SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion~~ 【SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion Jul 23, 2018

maryannxue changed the title ~~【SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion~~ [SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion Jul 23, 2018

address review comments

39b6ce9

asfgit closed this in 434319e Jul 23, 2018

markhamstra mentioned this pull request Jul 23, 2018

[SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion alteryx/spark#244

Merged

c21 mentioned this pull request Aug 27, 2020

[SPARK-32717][SQL] Add a AQEOptimizer for AdaptiveSparkPlanExec #29559

Closed

		import org.apache.spark.sql.internal.SQLConf.OPTIMIZER_EXCLUDED_RULES


		class OptimizerRuleExclusionSuite extends PlanTest {

[SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion #21764

[SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion #21764

Uh oh!

Conversation

maryannxue commented Jul 13, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 14, 2018

Uh oh!

maropu commented Jul 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 15, 2018

Uh oh!

gatorsmile commented Jul 15, 2018

Uh oh!

maropu commented Jul 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 18, 2018

Uh oh!

maropu commented Jul 18, 2018

Uh oh!

maropu commented Jul 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 18, 2018

Uh oh!

SparkQA commented Jul 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 14, 2018 •

edited

Loading

gatorsmile commented Jul 14, 2018 •

edited

Loading