-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17075][SQL][followup] fix filter estimation issues #17148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan @ron8hu please review |
|
Test build #73837 has finished for PR 17148 at commit
|
| // for not-supported condition, set filter selectivity to a conservative estimate 100% | ||
| case None => None | ||
| } | ||
| case Not(cond) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some comment to explain this case
| case None => None | ||
| } | ||
| case Not(cond) => | ||
| if (cond.isInstanceOf[And] || cond.isInstanceOf[Or]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just call calculateSingleCondition, which will return None for And and Or
| } | ||
|
|
||
| Some(1.0 / ndv.toDouble) | ||
| Some((1.0 / BigDecimal(ndv)).toDouble) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ndv is a BigInt, its range is bigger than double, so toDouble is not safe here, while 1/ndv is in (0, 1), so toDouble is safe
|
|
||
| // determine the overlapping degree between predicate range and column's range | ||
| val literalValueBD = BigDecimal(literal.value.toString) | ||
| val numericLiteral = if (literal.dataType.isInstanceOf[BooleanType]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
literal.dataType == BooleanType
| // Without advanced statistics like histogram, we assume uniform data distribution. | ||
| // We just prorate the adjusted range over the initial range to compute filter selectivity. | ||
| // For ease of computation, we convert all relevant numeric values to Double. | ||
| assert(max > min) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? it's possible after WHERE a = 1 that max and min are same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's in the partial overlap case, if max == min, it must be either no overlap or complete overlap for a binary expression. see here:
val (noOverlap: Boolean, completeOverlap: Boolean) = op match {
case _: LessThan =>
(numericLiteral <= min, numericLiteral > max)
case _: LessThanOrEqual =>
(numericLiteral < min, numericLiteral >= max)
case _: GreaterThan =>
(numericLiteral >= max, numericLiteral < min)
case _: GreaterThanOrEqual =>
(numericLiteral > max, numericLiteral <= min)
}
| case _: LessThan => newMax = newValue | ||
| case _: LessThanOrEqual => newMax = newValue | ||
| case _: GreaterThan => | ||
| if (newNdv == 1) newMin = newMax else newMin = newValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the new ndv is 1, then new max and new min must be equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add comments to explain this special case
| } | ||
| case Not(cond) => | ||
| if (cond.isInstanceOf[And] || cond.isInstanceOf[Or]) { | ||
| // Don't support compound Not expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we agreed not to support the nested NOT condition. First we need to clarify what is nested NOT. Here You allow ((NOT cond1) && (NOT cond2)). But you disallow a condition NOT(cond1 && cond2). Is this right?
How about this case NOT( cond1 && (NOT cond2))? The third case is a truly nested NOT case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our goal is to not under-estimate, so ((NOT cond1) && (NOT cond2)) should be allowed, NOT(cond1 && cond2) should not be, because we don't know if cond1 or cond1 is over-estimated or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we over-estimate for a condition cond1 in calculateSingleCondition, then Not(cond1) becomes under-estimation, even if cond1 is not a compound condition.
Actually I'm thinking whether it's reasonable to differentiate between under-estimation and over-estimation. Since we assume uniform distribution, we really can't be sure if we are over-estimating or not.
E.g. for condition a=1, we estimate filter factor as 1/ndv, it can be over-estimation if 1 is a value in the "long tail", or be under-estimation if 1 is the skew value of this column.
In fact, what we do now is just using some empirical formula to compute the probability that the condition satisfies.
So I suggest that we don't care about "nested Not" or "Not", just do what we do for other compound conditions as before:
case Not(cond) =>
calculateFilterSelectivity(cond, update = false) match {
case Some(percent) => Some(1.0 - percent)
case None => None
}
What do you think? @cloud-fan @ron8hu
| rowCount = 0) | ||
| } | ||
|
|
||
| test("cint IS NOT NULL") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may add a nested NOT test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| update: Boolean): Option[Double] = { | ||
| if (!colStatsMap.contains(attr)) { | ||
| logDebug("[CBO] No statistics for " + attr) | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't have stats, there's no need to go through the logic below.
|
Test build #73953 has finished for PR 17148 at commit
|
…d to keep column stats for empty output
|
Updated. I've made the following changes:
|
|
Test build #74002 has finished for PR 17148 at commit
|
|
Test build #74005 has finished for PR 17148 at commit
|
| // then p(And(c1, c2)) = p(c2), and p(Or(c1, c2)) = 1.0. | ||
| // But once they are wrapped in NOT condition, then after 1 - p, it becomes | ||
| // under-estimation. So in these cases, we consider them as unsupported. | ||
| case Not(And(cond1, cond2)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cond1 is also And, and is over-estimated, then we get a problem here.
We should not handle compound Not at all, just call calculateSingleCondition for Not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code is fine. If we just call calculateSingleCondition for "case Not(And(cond1, cond2))", then it is too restrictive. The current code computes selectivity for only when we can get selectivity for both conditions. If we cannot get selectivity for either one or both, then we just return None. I think it is a clean solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an idea: move Not into the compound condition, then we can handle any combinations:
Not(And(cond1, cond2) = Or(Not(cond1), Not(cond2))
Not(Or(cond1, cond2) = And(Not(cond1), Not(cond2))
| percent = op match { | ||
| case _: LessThan => | ||
| (literalDouble - minDouble) / (maxDouble - minDouble) | ||
| if (numericLiteral == max) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some comments to explain this special case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. updated.
| } | ||
|
|
||
| val testFilters = if (swappedFilter != filterNode) { | ||
| Seq(swappedFilter, filterNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good rewrite for method validateEstimatedStats.The current code has better readability than tail recursion.
| if (p1.isDefined && p2.isDefined) { | ||
| Some(1 - (p1.get + p2.get - (p1.get * p2.get))) | ||
| } else { | ||
| None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. We compute selectivity for "Not(Or(cond1, cond2))" only when we can get selectivity for both conditions. If we cannot get selectivity for either one or both, then we just return None. It is a clean solution.
| // then p(And(c1, c2)) = p(c2), and p(Or(c1, c2)) = 1.0. | ||
| // But once they are wrapped in NOT condition, then after 1 - p, it becomes | ||
| // under-estimation. So in these cases, we consider them as unsupported. | ||
| case Not(And(cond1, cond2)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code is fine. If we just call calculateSingleCondition for "case Not(And(cond1, cond2))", then it is too restrictive. The current code computes selectivity for only when we can get selectivity for both conditions. If we cannot get selectivity for either one or both, then we just return None. I think it is a clean solution.
|
Test build #74067 has finished for PR 17148 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
How was this patch tested?
modify related test cases.