-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19408][SQL] filter estimation on two columns of same table #17415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
cc @sameeragarwal @cloud-fan @gatorsmile This Jira is not on Spark 2.2 blocker list. If time permits, we can include it in Spark 2.2. If not, we can wait for a maintenance release. Thanks. |
|
Test build #75186 has finished for PR 17415 at commit
|
|
retest this please |
|
Test build #75223 has finished for PR 17415 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for EqualTo, there is no complete overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just revised the code to handle EqualNullSafe separately from EqualTo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the new ndv only look at ndvLeft?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why comment it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I should remove the line that has been commented out. This line is replaced by the following code:
if (rowCountValue != 0) {
// Need to check attributeStats one by one because we may have multiple output columns.
// Due to update operation, the output columns may be in different order.
expectedColStats.foreach { kv =>
val filterColumnStat = filterStats.attributeStats.get(kv._1).get
assert(filterColumnStat == kv._2)
}
|
It sounds like we have not supported a very common constant filter. Let me take a quick fix on that. : ) Our optimizer rule |
d6a53ef to
9830a8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use Equality above, but we did not handle EqualNullSafe. That will cause a strange case mismatch error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nullCount might not be simply set to zero if we also support EqualNullSafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use white list here? It is also easy for us to see which data types are assumed to support in the implementation.
I am afraid we might easily forget updating this if we support new data type in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code is written in such a way that we do not have too deep indentation. Some engineers do not like deep indentation as they often put screen monitor vertically.
Let's handle it when the need occurs. I think, with good test case coverage, we will be able to catch anything we miss.
|
Test build #75288 has finished for PR 17415 at commit
|
|
Test build #75284 has finished for PR 17415 at commit
|
9830a8f to
7abed99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a given column -> two given columns. Both two columns' ColumnStat are updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(maxLeft <= minRight, minLeft > maxRight)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(maxLeft < minRight, minLeft >= maxRight)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minLeft >= maxRight, maxLeft < minRight)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minLeft > maxRight, maxLeft <= minRight)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
|
Test build #75335 has finished for PR 17415 at commit
|
64bf43e to
70ac70c
Compare
|
Test build #75367 has finished for PR 17415 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra space.
70ac70c to
9b98ff1
Compare
|
Test build #75369 has finished for PR 17415 at commit
|
|
Test build #75375 has finished for PR 17415 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once no overlap, is it still meaningful to keep min, max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need one more condition: minLeft == minRight, like @gatorsmile suggested #17415 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You said "we need one more condition: minLeft == minRight". Note that this condition is already included.
@gatorsmile was suggesting "(minRight == maxRight) && (minLeft == minRight) && (maxLeft == maxRight)". This implies all 4 values (minLeft, maxLeft, minRight, maxRight) are equal. This is not what I mean by complete overlap. I initially defined "complete overlap" as "complete range overlap". For example, we have a test case: test("cint = cint4"). If we use @gatorsmile's definition, then the case "cint = cint4" will become partial overlap with selectivity 0.33, which will under-estimate the selectivity. In order to avoid out-of-memory error, I prefer over-estimating rather than under-estimating.
Also it should be noted that "complete range overlap" should cover "complete point overlap".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Estimation is always hard to be accurate. That is why user-provided hints are very useful for getting the right plan.
I did a search. There are many different estimators. Uniform estimators, length estimator, digram estimator, minimum variance estimators, and histogram estimators. Is that possible we can consider the data distributions when deciding the selectivity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, estimation is always hard to be accurate. We may consider supporting hints in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also compare the NDV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We changed condition to:
(minLeft == minRight) && (maxLeft == maxRight) && allNotNull
&& (colStatLeft.distinctCount == colStatRight.distinctCount)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that the when this condition is true, it is a complete overlapping between two columns.
The complete equality between the values of two columns also depends on the order. E.g., when left values are (1, 2, 3, 4), right values are (4, 3, 2, 1), the condition is true, but no values can pass the filter predicate left_col = right_col.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is empirical. Without more statistics, it's really hard to do it mathematically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We prefer over estimation to under estimation in order to avoid out-of-memory error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. LGTM.
|
Test build #75470 has finished for PR 17415 at commit
|
|
Test build #75472 has started for PR 17415 at commit |
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&& (colStatLeft.distinctCount == colStatRight.distinctCount)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indention is wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only checks the expectedColStats is a sub-set of filterStats.attributeStats, shall we also check the size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. fixed.
|
LGTM except some minor comments |
|
Test build #75477 has finished for PR 17415 at commit
|
|
Test build #75484 has finished for PR 17415 at commit
|
|
LGTM |
1 similar comment
|
LGTM |
|
Thanks, Merging to master. |
What changes were proposed in this pull request?
In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work.
This PR estimates filter selectivity on two columns of same table. For example, multiple tpc-h queries have this predicate "WHERE l_commitdate < l_receiptdate"
How was this patch tested?
We added 6 new test cases to test various logical predicates involving two columns of same table.
Please review http://spark.apache.org/contributing.html before opening a pull request.