Skip to content

Conversation

@maryannxue
Copy link
Contributor

What changes were proposed in this pull request?

Since Spark has provided fairly clear interfaces for adding user-defined optimization rules, it would be nice to have an easy-to-use interface for excluding an optimization rule from the Spark query optimizer as well.

This would make customizing Spark optimizer easier and sometimes could debugging issues too.

  • Add a new config spark.sql.optimizer.excludedRules, with the value being a list of rule names separated by comma.
  • Modify the current batches method to remove the excluded rules from the default batches. Log the rules that have been excluded.
  • Split the existing default batches into "post-analysis batches" and "optimization batches" so that only rules in the "optimization batches" can be excluded.

How was this patch tested?

Add a new test suite: OptimizerRuleExclusionSuite

@SparkQA
Copy link

SparkQA commented Jul 14, 2018

Test build #92992 has finished for PR 21764 at commit eaec2f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 14, 2018

Do you have concrete usecases in your business? Basically, I think the optimizer is a black-box for most users and they don't easily understand how it works correctly when excluding some rules. Are there other database-like systems that implement this kind of interfaces?

Even in debugging uses, is it not enough to define an individual test optimizer for each debuggin use?, e.g.,

@gatorsmile
Copy link
Member

gatorsmile commented Jul 14, 2018

Let me give an example. The ticket https://issues.apache.org/jira/browse/SPARK-24624 shows a common issue in which our optimizer does not work well. It is a bug but our users are unable to easily bypass it. The most straightforward way is to disable the rule.

ReplaceDeduplicateWithAggregate) :: Nil
}

protected def optimizationBatches: Seq[Batch] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In optimizationBatches, some rules can't be excluded. Without them, the affected queries can't be executed. For example,

     Batch("Replace Operators", fixedPoint,
      ReplaceIntersectWithSemiJoin,
      ReplaceExceptWithFilter,
      ReplaceExceptWithAntiJoin,
      ReplaceDistinctWithAggregate)

Can we just introduce a black list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So can I do black list of batches?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. We need to exclude Batch("Eliminate Distinct"), Batch("Finish Analysis"), Batch("Replace Operators"), Batch("Pullup Correlated Expressions"), and Batch("RewriteSubquery")

}
!exclude
}
if (batch.rules == filteredRules) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the if, else if and else can be removed? Just return the filtered batch?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it is written that way to allow for logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to:

  1. avoid unnecessary object creation if all rules have been preserved.
  2. avoid empty batches if all rules in the batch have been removed.

@maropu
Copy link
Member

maropu commented Jul 15, 2018

@gatorsmile aha, ok. We need to make this option not internal but external?

BTW, the interfaces to add/delete optimizer rules (addition via ExperimentalMethods and deletion via SQLConf) are different and is this design ok?

@gatorsmile
Copy link
Member

@maropu This is for advanced end users or Spark developers. External conf looks fine, but I have to admit this might be rarely used. BTW, after having this conf, we can deprecate a few internal configurations that are used for disabling specific optimizer rules in SPARK 3.0.

@maropu
Copy link
Member

maropu commented Jul 15, 2018

ok, thx for the kind explanation.

Batch("Eliminate Distinct", Once, EliminateDistinct) ::
// Technically some of the rules in Finish Analysis are not optimizer rules and belong more
// in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime).
// However, because we also use the analyzer to canonicalized queries (for view definition),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to canonicalized" -> "to canonicalize" ?

}
!exclude
}
if (batch.rules == filteredRules) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it is written that way to allow for logging

val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")
.doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " +
"specified by their rule names and separated by comma. It is not guaranteed that all the " +
"rules in this configuration will eventually be excluded, as some rules are necessary " +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the optimizer at a low level (I'd be one of those users for which it is a blackbox), would you think it would be feasible to enumerate the rules that cannot be excluded ? Maybe even logging a WARNING when validating the config parameters if it contains required rules

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion! @gatorsmile's other suggestion was to introduce a blacklist, in which case this enumeration of rules that cannot be excluded can be made possible. I can do a warning as well.

import org.apache.spark.sql.internal.SQLConf.OPTIMIZER_EXCLUDED_RULES


class OptimizerRuleExclusionSuite extends PlanTest {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any test case for when a required rule is being passed as a "to be excluded" rule ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added :)

}
}

val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow this here, this will affect Spark's caching/uncaching plans and tables inconsistently. For the purpose of this PR, StaticSQLConf.scala would be a perfect place for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about SQL cache? I don't think optimizer has anything to do with SQL cache though, since the logical plans used to match cache entries are "analyzed" plans not "optimized" plans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since an optimizer should not change query semantics(results), it should work well for the case @dongjoon-hyun described. If this is mainly used for debugging uses, I think it would be nice to use this conf on runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on debugging purpose. Still, CacheManager matches the analyzed plan not the optimized plan.

@SparkQA
Copy link

SparkQA commented Jul 18, 2018

Test build #93217 has finished for PR 21764 at commit 84f1a6b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 18, 2018

retest this please

@maropu
Copy link
Member

maropu commented Jul 18, 2018

btw, I feel the title is a little obscure and how about [SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion?

"Finish Analysis" ::
"Replace Operators" ::
"Pullup Correlated Expressions" ::
"RewriteSubquery" :: Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use not rule names but batch names in this black list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change to rule black list.


override def batches: Seq[Batch] = {
val excludedRules =
SQLConf.get.optimizerExcludedRules.toSeq.flatMap(_.split(",").map(_.trim).filter(!_.isEmpty))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: !_.isEmpty -> _.nonEmpty

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you need to handle case-sensitivity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an auto-generated field ruleName in Rule, so we do exact name matching (case sensitive).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Utils.stringToSeq?

@SparkQA
Copy link

SparkQA commented Jul 18, 2018

Test build #93220 has finished for PR 21764 at commit 84f1a6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2018

Test build #93400 has finished for PR 21764 at commit b154979.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

RewritePredicateSubquery.ruleName ::
ColumnPruning.ruleName ::
CollapseProject.ruleName ::
RemoveRedundantProject.ruleName :: Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the last three?


override def batches: Seq[Batch] = {
val excludedRulesConf =
SQLConf.get.optimizerExcludedRules.toSeq.flatMap(_.split(",").map(_.trim).filter(_.nonEmpty))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use Utils.stringToSeq?
#21764 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. It's just I didn't know about it. Thank you for point this out!

@maropu
Copy link
Member

maropu commented Jul 23, 2018

Also, can you update the title? You need to at least add [SQL] in the title: #21764 (comment)

@maryannxue maryannxue changed the title [SPARK-24802] Optimization Rule Exclusion SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion Jul 23, 2018
@maryannxue maryannxue changed the title SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion 【SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion Jul 23, 2018
@maryannxue maryannxue changed the title 【SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion [SPARK-24802][SQL] Add a new config for Optimization Rule Exclusion Jul 23, 2018
@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Jul 23, 2018

Test build #93419 has finished for PR 21764 at commit 87afe4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 23, 2018

LGTM, too

@SparkQA
Copy link

SparkQA commented Jul 23, 2018

Test build #93428 has finished for PR 21764 at commit 39b6ce9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 23, 2018

Test build #93435 has finished for PR 21764 at commit 39b6ce9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 434319e Jul 23, 2018
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jul 23, 2018
## What changes were proposed in this pull request?

Since Spark has provided fairly clear interfaces for adding user-defined optimization rules, it would be nice to have an easy-to-use interface for excluding an optimization rule from the Spark query optimizer as well.

This would make customizing Spark optimizer easier and sometimes could debugging issues too.

- Add a new config spark.sql.optimizer.excludedRules, with the value being a list of rule names separated by comma.
- Modify the current batches method to remove the excluded rules from the default batches. Log the rules that have been excluded.
- Split the existing default batches into "post-analysis batches" and "optimization batches" so that only rules in the "optimization batches" can be excluded.

## How was this patch tested?

Add a new test suite: OptimizerRuleExclusionSuite

Author: maryannxue <[email protected]>

Closes apache#21764 from maryannxue/rule-exclusion.
@toderesa97
Copy link

Hello,

It is not very clear for me, how to exclude some rules. I have been digging into the testing file for OptimizerRuleExclusionSuite.scala and not able to exclude some rules. As the jira report for issue SPARK-24802 says:

Add a new config spark.sql.optimizer.excludedRules, with the value being a list of rule names separated by comma

I tried this:

import org.apache.spark.sql.{SparkSession}
import org.apache.spark.sql.catalyst.optimizer._


object Main {


  def getExcludeRules: Seq[String] = {
    Seq(
      PushPredicateThroughJoin.ruleName
    )
  }

  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession
    .builder
    .config("spark.sql.optimizer.excludedRules", getExcludeRules)
    .master("local[*]")
    .getOrCreate
    val sc = spark.sparkContext
    
    // whatever

  }
}

but it is not working since there is not such a method config that receives a second parameter a Seq[String].

Thank you for any help you can provide.

@maropu
Copy link
Member

maropu commented Apr 20, 2020

Hi, @toderesa97. You can use it like this;

scala> Seq("abc", "def").toDF("v").write.saveAsTable("t")
scala> sql("SELECT * FROM t WHERE v LIKE '%bc'").explain()
== Physical Plan ==
*(1) Project [v#18]
+- *(1) Filter (isnotnull(v#18) AND EndsWith(v#18, bc))
                                    ^^^^^^^^
   +- *(1) ColumnarToRow
      +- FileScan parquet default.t[v#18] ...

scala> sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.LikeSimplification")

scala> sql("SELECT * FROM t WHERE v LIKE '%bc'").explain()
== Physical Plan ==
*(1) Project [v#18]
+- *(1) Filter (isnotnull(v#18) AND v#18 LIKE %bc)
                                         ^^^^
   +- *(1) ColumnarToRow
      +- FileScan parquet default.t[v#18] ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants