[SPARK-17075][SQL][followup] fix filter estimation issues #17148

wzhfy · 2017-03-03T09:21:25Z

What changes were proposed in this pull request?

support boolean type in binary expression estimation.
deal with compound Not conditions.
avoid convert BigInt/BigDecimal directly to double unless it's within range (0, 1).
reorganize test code.

How was this patch tested?

modify related test cases.

wzhfy · 2017-03-03T09:24:09Z

cc @cloud-fan @ron8hu please review

SparkQA · 2017-03-03T11:25:23Z

Test build #73837 has finished for PR 17148 at commit d4787b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-03T19:44:37Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

-        // for not-supported condition, set filter selectivity to a conservative estimate 100%
-        case None => None
-      }
+      case Not(cond) =>


add some comment to explain this case

cloud-fan · 2017-03-03T19:45:02Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

-        case None => None
-      }
+      case Not(cond) =>
+        if (cond.isInstanceOf[And] || cond.isInstanceOf[Or]) {


we can just call calculateSingleCondition, which will return None for And and Or

cloud-fan · 2017-03-03T19:46:48Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

      }

-      Some(1.0 / ndv.toDouble)
+      Some((1.0 / BigDecimal(ndv)).toDouble)


ndv is a BigInt, its range is bigger than double, so toDouble is not safe here, while 1/ndv is in (0, 1), so toDouble is safe

cloud-fan · 2017-03-03T19:49:17Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala


    // determine the overlapping degree between predicate range and column's range
-    val literalValueBD = BigDecimal(literal.value.toString)
+    val numericLiteral = if (literal.dataType.isInstanceOf[BooleanType]) {


literal.dataType == BooleanType

cloud-fan · 2017-03-03T19:51:24Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

      // Without advanced statistics like histogram, we assume uniform data distribution.
      // We just prorate the adjusted range over the initial range to compute filter selectivity.
-      // For ease of computation, we convert all relevant numeric values to Double.
+      assert(max > min)


why? it's possible after WHERE a = 1 that max and min are same

it's in the partial overlap case, if max == min, it must be either no overlap or complete overlap for a binary expression. see here:

val (noOverlap: Boolean, completeOverlap: Boolean) = op match { case _: LessThan => (numericLiteral <= min, numericLiteral > max) case _: LessThanOrEqual => (numericLiteral < min, numericLiteral >= max) case _: GreaterThan => (numericLiteral >= max, numericLiteral < min) case _: GreaterThanOrEqual => (numericLiteral > max, numericLiteral <= min) }

cloud-fan · 2017-03-03T19:53:32Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

-          case _: LessThan => newMax = newValue
-          case _: LessThanOrEqual => newMax = newValue
+          case _: GreaterThan =>
+            if (newNdv == 1) newMin = newMax else newMin = newValue


what's the logic here?

if the new ndv is 1, then new max and new min must be equal.

please add comments to explain this special case

ron8hu · 2017-03-04T00:06:17Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

-      }
+      case Not(cond) =>
+        if (cond.isInstanceOf[And] || cond.isInstanceOf[Or]) {
+          // Don't support compound Not expression.


I thought that we agreed not to support the nested NOT condition. First we need to clarify what is nested NOT. Here You allow ((NOT cond1) && (NOT cond2)). But you disallow a condition NOT(cond1 && cond2). Is this right?
How about this case NOT( cond1 && (NOT cond2))? The third case is a truly nested NOT case.

our goal is to not under-estimate, so ((NOT cond1) && (NOT cond2)) should be allowed, NOT(cond1 && cond2) should not be, because we don't know if cond1 or cond1 is over-estimated or not.

If we over-estimate for a condition cond1 in calculateSingleCondition, then Not(cond1) becomes under-estimation, even if cond1 is not a compound condition.

Actually I'm thinking whether it's reasonable to differentiate between under-estimation and over-estimation. Since we assume uniform distribution, we really can't be sure if we are over-estimating or not.

E.g. for condition a=1, we estimate filter factor as 1/ndv, it can be over-estimation if 1 is a value in the "long tail", or be under-estimation if 1 is the skew value of this column.
In fact, what we do now is just using some empirical formula to compute the probability that the condition satisfies.

So I suggest that we don't care about "nested Not" or "Not", just do what we do for other compound conditions as before:

case Not(cond) => calculateFilterSelectivity(cond, update = false) match { case Some(percent) => Some(1.0 - percent) case None => None }

What do you think? @cloud-fan @ron8hu

ron8hu · 2017-03-04T00:09:57Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala

+      rowCount = 0)
  }

  test("cint IS NOT NULL") {


may add a nested NOT test case.

ninadvps · 2017-03-04T01:37:48Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

      update: Boolean): Option[Double] = {
+    if (!colStatsMap.contains(attr)) {
+      logDebug("[CBO] No statistics for " + attr)
+      return None


why return?

If we don't have stats, there's no need to go through the logic below.

SparkQA · 2017-03-06T03:33:49Z

Test build #73953 has finished for PR 17148 at commit 7c8d012.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…d to keep column stats for empty output

wzhfy · 2017-03-06T12:54:18Z

Updated. I've made the following changes:

deal with compound Not conditions;
reorganize test code;
no need to keep column stats for empty output.

SparkQA · 2017-03-06T14:32:25Z

Test build #74002 has finished for PR 17148 at commit dbfd4c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T14:56:38Z

Test build #74005 has finished for PR 17148 at commit 895662d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-06T18:52:53Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

+      // then p(And(c1, c2)) = p(c2), and p(Or(c1, c2)) = 1.0.
+      // But once they are wrapped in NOT condition, then after 1 - p, it becomes
+      // under-estimation. So in these cases, we consider them as unsupported.
+      case Not(And(cond1, cond2)) =>


If cond1 is also And, and is over-estimated, then we get a problem here.

We should not handle compound Not at all, just call calculateSingleCondition for Not.

The current code is fine. If we just call calculateSingleCondition for "case Not(And(cond1, cond2))", then it is too restrictive. The current code computes selectivity for only when we can get selectivity for both conditions. If we cannot get selectivity for either one or both, then we just return None. I think it is a clean solution.

Here's an idea: move Not into the compound condition, then we can handle any combinations:

Not(And(cond1, cond2) = Or(Not(cond1), Not(cond2)) Not(Or(cond1, cond2) = And(Not(cond1), Not(cond2))

cloud-fan · 2017-03-06T18:58:35Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

      percent = op match {
        case _: LessThan =>
-          (literalDouble - minDouble) / (maxDouble - minDouble)
+          if (numericLiteral == max) {


can you add some comments to explain this special case?

ok. updated.

ron8hu · 2017-03-06T21:29:14Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala

+    }
+
+    val testFilters = if (swappedFilter != filterNode) {
+      Seq(swappedFilter, filterNode)


This is a good rewrite for method validateEstimatedStats.The current code has better readability than tail recursion.

ron8hu · 2017-03-07T00:03:01Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

+        if (p1.isDefined && p2.isDefined) {
+          Some(1 - (p1.get + p2.get - (p1.get * p2.get)))
+        } else {
+          None


This is good. We compute selectivity for "Not(Or(cond1, cond2))" only when we can get selectivity for both conditions. If we cannot get selectivity for either one or both, then we just return None. It is a clean solution.

ron8hu · 2017-03-07T00:06:40Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

+      // then p(And(c1, c2)) = p(c2), and p(Or(c1, c2)) = 1.0.
+      // But once they are wrapped in NOT condition, then after 1 - p, it becomes
+      // under-estimation. So in these cases, we consider them as unsupported.
+      case Not(And(cond1, cond2)) =>


The current code is fine. If we just call calculateSingleCondition for "case Not(And(cond1, cond2))", then it is too restrictive. The current code computes selectivity for only when we can get selectivity for both conditions. If we cannot get selectivity for either one or both, then we just return None. I think it is a clean solution.

SparkQA · 2017-03-07T06:02:06Z

Test build #74067 has finished for PR 17148 at commit 685dff3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-07T07:54:12Z

thanks, merging to master!

fix filter estimation issues

d4787b8

cloud-fan reviewed Mar 3, 2017

View reviewed changes

ron8hu suggested changes Mar 4, 2017

View reviewed changes

ninadvps reviewed Mar 4, 2017

View reviewed changes

minor fix

7c8d012

wzhfy and others added 2 commits March 6, 2017 20:30

1.deal with compound Not conditions; 2.reorganize test code; 3.no nee…

dbfd4c7

…d to keep column stats for empty output

fix import

895662d

cloud-fan reviewed Mar 6, 2017

View reviewed changes

ron8hu reviewed Mar 6, 2017

View reviewed changes

ron8hu suggested changes Mar 7, 2017

View reviewed changes

convert compound Not conditions

685dff3

asfgit closed this in 932196d Mar 7, 2017

[SPARK-17075][SQL][followup] fix filter estimation issues #17148

[SPARK-17075][SQL][followup] fix filter estimation issues #17148

Uh oh!

Conversation

wzhfy commented Mar 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

wzhfy commented Mar 3, 2017

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

wzhfy commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

cloud-fan commented Mar 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

wzhfy commented Mar 3, 2017 •

edited

Loading