Skip to content

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Mar 3, 2017

What changes were proposed in this pull request?

  1. support boolean type in binary expression estimation.
  2. deal with compound Not conditions.
  3. avoid convert BigInt/BigDecimal directly to double unless it's within range (0, 1).
  4. reorganize test code.

How was this patch tested?

modify related test cases.

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 3, 2017

cc @cloud-fan @ron8hu please review

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73837 has finished for PR 17148 at commit d4787b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// for not-supported condition, set filter selectivity to a conservative estimate 100%
case None => None
}
case Not(cond) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comment to explain this case

case None => None
}
case Not(cond) =>
if (cond.isInstanceOf[And] || cond.isInstanceOf[Or]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just call calculateSingleCondition, which will return None for And and Or

}

Some(1.0 / ndv.toDouble)
Some((1.0 / BigDecimal(ndv)).toDouble)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ndv is a BigInt, its range is bigger than double, so toDouble is not safe here, while 1/ndv is in (0, 1), so toDouble is safe


// determine the overlapping degree between predicate range and column's range
val literalValueBD = BigDecimal(literal.value.toString)
val numericLiteral = if (literal.dataType.isInstanceOf[BooleanType]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

literal.dataType == BooleanType

// Without advanced statistics like histogram, we assume uniform data distribution.
// We just prorate the adjusted range over the initial range to compute filter selectivity.
// For ease of computation, we convert all relevant numeric values to Double.
assert(max > min)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? it's possible after WHERE a = 1 that max and min are same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's in the partial overlap case, if max == min, it must be either no overlap or complete overlap for a binary expression. see here:

val (noOverlap: Boolean, completeOverlap: Boolean) = op match {
      case _: LessThan =>
        (numericLiteral <= min, numericLiteral > max)
      case _: LessThanOrEqual =>
        (numericLiteral < min, numericLiteral >= max)
      case _: GreaterThan =>
        (numericLiteral >= max, numericLiteral < min)
      case _: GreaterThanOrEqual =>
        (numericLiteral > max, numericLiteral <= min)
    }

case _: LessThan => newMax = newValue
case _: LessThanOrEqual => newMax = newValue
case _: GreaterThan =>
if (newNdv == 1) newMin = newMax else newMin = newValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the logic here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the new ndv is 1, then new max and new min must be equal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments to explain this special case

}
case Not(cond) =>
if (cond.isInstanceOf[And] || cond.isInstanceOf[Or]) {
// Don't support compound Not expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we agreed not to support the nested NOT condition. First we need to clarify what is nested NOT. Here You allow ((NOT cond1) && (NOT cond2)). But you disallow a condition NOT(cond1 && cond2). Is this right?
How about this case NOT( cond1 && (NOT cond2))? The third case is a truly nested NOT case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our goal is to not under-estimate, so ((NOT cond1) && (NOT cond2)) should be allowed, NOT(cond1 && cond2) should not be, because we don't know if cond1 or cond1 is over-estimated or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we over-estimate for a condition cond1 in calculateSingleCondition, then Not(cond1) becomes under-estimation, even if cond1 is not a compound condition.

Actually I'm thinking whether it's reasonable to differentiate between under-estimation and over-estimation. Since we assume uniform distribution, we really can't be sure if we are over-estimating or not.

E.g. for condition a=1, we estimate filter factor as 1/ndv, it can be over-estimation if 1 is a value in the "long tail", or be under-estimation if 1 is the skew value of this column.
In fact, what we do now is just using some empirical formula to compute the probability that the condition satisfies.

So I suggest that we don't care about "nested Not" or "Not", just do what we do for other compound conditions as before:

    case Not(cond) =>
      calculateFilterSelectivity(cond, update = false) match {
        case Some(percent) => Some(1.0 - percent)
        case None => None
      }

What do you think? @cloud-fan @ron8hu

rowCount = 0)
}

test("cint IS NOT NULL") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may add a nested NOT test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

update: Boolean): Option[Double] = {
if (!colStatsMap.contains(attr)) {
logDebug("[CBO] No statistics for " + attr)
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have stats, there's no need to go through the logic below.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73953 has finished for PR 17148 at commit 7c8d012.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 6, 2017

Updated. I've made the following changes:

  1. deal with compound Not conditions;
  2. reorganize test code;
  3. no need to keep column stats for empty output.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74002 has finished for PR 17148 at commit dbfd4c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74005 has finished for PR 17148 at commit 895662d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// then p(And(c1, c2)) = p(c2), and p(Or(c1, c2)) = 1.0.
// But once they are wrapped in NOT condition, then after 1 - p, it becomes
// under-estimation. So in these cases, we consider them as unsupported.
case Not(And(cond1, cond2)) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cond1 is also And, and is over-estimated, then we get a problem here.

We should not handle compound Not at all, just call calculateSingleCondition for Not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is fine. If we just call calculateSingleCondition for "case Not(And(cond1, cond2))", then it is too restrictive. The current code computes selectivity for only when we can get selectivity for both conditions. If we cannot get selectivity for either one or both, then we just return None. I think it is a clean solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an idea: move Not into the compound condition, then we can handle any combinations:

Not(And(cond1, cond2) = Or(Not(cond1), Not(cond2))
Not(Or(cond1, cond2) = And(Not(cond1), Not(cond2))

percent = op match {
case _: LessThan =>
(literalDouble - minDouble) / (maxDouble - minDouble)
if (numericLiteral == max) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments to explain this special case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. updated.

}

val testFilters = if (swappedFilter != filterNode) {
Seq(swappedFilter, filterNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good rewrite for method validateEstimatedStats.The current code has better readability than tail recursion.

if (p1.isDefined && p2.isDefined) {
Some(1 - (p1.get + p2.get - (p1.get * p2.get)))
} else {
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. We compute selectivity for "Not(Or(cond1, cond2))" only when we can get selectivity for both conditions. If we cannot get selectivity for either one or both, then we just return None. It is a clean solution.

// then p(And(c1, c2)) = p(c2), and p(Or(c1, c2)) = 1.0.
// But once they are wrapped in NOT condition, then after 1 - p, it becomes
// under-estimation. So in these cases, we consider them as unsupported.
case Not(And(cond1, cond2)) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is fine. If we just call calculateSingleCondition for "case Not(And(cond1, cond2))", then it is too restrictive. The current code computes selectivity for only when we can get selectivity for both conditions. If we cannot get selectivity for either one or both, then we just return None. I think it is a clean solution.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74067 has finished for PR 17148 at commit 685dff3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 932196d Mar 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants