[SPARK-19408][SQL] filter estimation on two columns of same table #17415

ron8hu · 2017-03-24T20:54:09Z

What changes were proposed in this pull request?

In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work.

This PR estimates filter selectivity on two columns of same table. For example, multiple tpc-h queries have this predicate "WHERE l_commitdate < l_receiptdate"

How was this patch tested?

We added 6 new test cases to test various logical predicates involving two columns of same table.

Please review http://spark.apache.org/contributing.html before opening a pull request.

gatorsmile · 2017-03-24T23:21:08Z

ok to test

ron8hu · 2017-03-25T00:17:05Z

cc @sameeragarwal @cloud-fan @gatorsmile This Jira is not on Spark 2.2 blocker list. If time permits, we can include it in Spark 2.2. If not, we can wait for a maintenance release. Thanks.

SparkQA · 2017-03-25T00:46:33Z

Test build #75186 has finished for PR 17415 at commit 8930669.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-25T18:50:59Z

retest this please

SparkQA · 2017-03-25T20:59:58Z

Test build #75223 has finished for PR 17415 at commit 8930669.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-27T06:48:40Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

I think for EqualTo, there is no complete overlap.

I just revised the code to handle EqualNullSafe separately from EqualTo.

cloud-fan · 2017-03-27T06:51:30Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

why the new ndv only look at ndvLeft?

Good point. Fixed.

cloud-fan · 2017-03-27T06:53:13Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala

why comment it out?

My bad. I should remove the line that has been commented out. This line is replaced by the following code:
if (rowCountValue != 0) {
// Need to check attributeStats one by one because we may have multiple output columns.
// Due to update operation, the output columns may be in different order.
expectedColStats.foreach { kv =>
val filterColumnStat = filterStats.attributeStats.get(kv._1).get
assert(filterColumnStat == kv._2)
}

gatorsmile · 2017-03-27T23:36:59Z

It sounds like we have not supported a very common constant filter. Let me take a quick fix on that. : ) Our optimizer rule PruneFilters is not advanced enough to remove all the literal filters.

gatorsmile · 2017-03-28T01:00:08Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

We use Equality above, but we did not handle EqualNullSafe. That will cause a strange case mismatch error.

Good catch. Fixed.

gatorsmile · 2017-03-28T01:02:53Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

nullCount might not be simply set to zero if we also support EqualNullSafe

gatorsmile · 2017-03-28T01:09:25Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

Could we use white list here? It is also easy for us to see which data types are assumed to support in the implementation.

I am afraid we might easily forget updating this if we support new data type in the future.

The current code is written in such a way that we do not have too deep indentation. Some engineers do not like deep indentation as they often put screen monitor vertically.
Let's handle it when the need occurs. I think, with good test case coverage, we will be able to catch anything we miss.

SparkQA · 2017-03-28T02:50:04Z

Test build #75288 has finished for PR 17415 at commit 9830a8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-28T02:53:23Z

Test build #75284 has finished for PR 17415 at commit d6a53ef.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

viirya · 2017-03-29T03:08:36Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

a given column -> two given columns. Both two columns' ColumnStat are updated.

viirya · 2017-03-29T03:23:17Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

(maxLeft <= minRight, minLeft > maxRight)?

Good catch. fixed.

viirya · 2017-03-29T03:23:19Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

(maxLeft < minRight, minLeft >= maxRight)?

Good catch. Fixed.

viirya · 2017-03-29T03:24:40Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

(minLeft >= maxRight, maxLeft < minRight)?

Good catch. Fixed.

viirya · 2017-03-29T03:25:32Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

(minLeft > maxRight, maxLeft <= minRight)?

Good catch. Fixed.

SparkQA · 2017-03-29T05:06:17Z

Test build #75335 has finished for PR 17415 at commit 7abed99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-29T22:16:42Z

Test build #75367 has finished for PR 17415 at commit 64bf43e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-29T23:17:55Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

nit: extra space.

SparkQA · 2017-03-29T23:54:24Z

Test build #75369 has finished for PR 17415 at commit 70ac70c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-30T02:01:16Z

Test build #75375 has finished for PR 17415 at commit 9b98ff1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-30T02:44:26Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala

Once no overlap, is it still meaningful to keep min, max?

cloud-fan · 2017-04-02T01:47:03Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

we need one more condition: minLeft == minRight, like @gatorsmile suggested #17415 (comment)

You said "we need one more condition: minLeft == minRight". Note that this condition is already included.

@gatorsmile was suggesting "(minRight == maxRight) && (minLeft == minRight) && (maxLeft == maxRight)". This implies all 4 values (minLeft, maxLeft, minRight, maxRight) are equal. This is not what I mean by complete overlap. I initially defined "complete overlap" as "complete range overlap". For example, we have a test case: test("cint = cint4"). If we use @gatorsmile's definition, then the case "cint = cint4" will become partial overlap with selectivity 0.33, which will under-estimate the selectivity. In order to avoid out-of-memory error, I prefer over-estimating rather than under-estimating.

Also it should be noted that "complete range overlap" should cover "complete point overlap".

Estimation is always hard to be accurate. That is why user-provided hints are very useful for getting the right plan.

I did a search. There are many different estimators. Uniform estimators, length estimator, digram estimator, minimum variance estimators, and histogram estimators. Is that possible we can consider the data distributions when deciding the selectivity?

Yes, estimation is always hard to be accurate. We may consider supporting hints in the future.

gatorsmile · 2017-04-03T00:00:44Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

Also compare the NDV?

Good point. We changed condition to:
(minLeft == minRight) && (maxLeft == maxRight) && allNotNull
&& (colStatLeft.distinctCount == colStatRight.distinctCount)

I doubt that the when this condition is true, it is a complete overlapping between two columns.

The complete equality between the values of two columns also depends on the order. E.g., when left values are (1, 2, 3, 4), right values are (4, 3, 2, 1), the condition is true, but no values can pass the filter predicate left_col = right_col.

Am I missing something?

This is empirical. Without more statistics, it's really hard to do it mathematically.

Agreed. We prefer over estimation to under estimation in order to avoid out-of-memory error.

SparkQA · 2017-04-03T01:47:01Z

Test build #75470 has finished for PR 17415 at commit 4f0b68f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-03T05:02:33Z

Test build #75472 has started for PR 17415 at commit 5a02705.

cloud-fan · 2017-04-03T09:03:08Z

retest this please

cloud-fan · 2017-04-03T09:05:34Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

&& (colStatLeft.distinctCount == colStatRight.distinctCount)

cloud-fan · 2017-04-03T09:10:38Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

The indention is wrong here.

cloud-fan · 2017-04-03T09:15:07Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala

This only checks the expectedColStats is a sub-set of filterStats.attributeStats, shall we also check the size?

Good point. fixed.

cloud-fan · 2017-04-03T09:26:31Z

LGTM except some minor comments

SparkQA · 2017-04-03T14:32:28Z

Test build #75477 has finished for PR 17415 at commit 5a02705.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-03T19:55:34Z

Test build #75484 has finished for PR 17415 at commit 3f3d30d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-04-03T20:59:09Z

LGTM

viirya · 2017-04-04T00:25:02Z

LGTM

gatorsmile · 2017-04-04T00:27:35Z

Thanks, Merging to master.

cloud-fan reviewed Mar 27, 2017

View reviewed changes

ron8hu force-pushed the filterTwoColumns branch from d6a53ef to 9830a8f Compare March 28, 2017 00:39

gatorsmile reviewed Mar 28, 2017

View reviewed changes

ron8hu force-pushed the filterTwoColumns branch from 9830a8f to 7abed99 Compare March 29, 2017 02:53

viirya reviewed Mar 29, 2017

View reviewed changes

ron8hu force-pushed the filterTwoColumns branch 2 times, most recently from 64bf43e to 70ac70c Compare March 29, 2017 21:34

viirya reviewed Mar 29, 2017

View reviewed changes

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala Outdated

Copy link

Member

viirya Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space.

ron8hu force-pushed the filterTwoColumns branch from 70ac70c to 9b98ff1 Compare March 29, 2017 23:43

viirya reviewed Mar 30, 2017

View reviewed changes

cloud-fan reviewed Apr 2, 2017

View reviewed changes

ron8hu force-pushed the filterTwoColumns branch from bf440db to 4f0b68f Compare April 2, 2017 23:30

gatorsmile reviewed Apr 3, 2017

View reviewed changes

cloud-fan reviewed Apr 3, 2017

View reviewed changes

ron8hu and others added 11 commits April 3, 2017 10:37

filter estimation on two columns of same table

7980cd1

revise ndv for both tables in predicate

150108e

handle EqualNullSafe separately

9bfd35d

revise boundary conditions in evaluating range overlap

426c8f3

handled EqualNullSafe if stats update is needed

8faf790

set expectedColStats to Nil when there is no overlap

e2699e4

added a couple of complete-overlap test cases

b760bf8

need to consider null for complete overlap case

64d796e

use allNotNull Boolean condition

8bc1be2

added condition to check distinctCount for EqualTo operator

881096c

check if expectedColStats.size == filterStats.attributeStats.size

3f3d30d

ron8hu force-pushed the filterTwoColumns branch from 5a02705 to 3f3d30d Compare April 3, 2017 17:41

asfgit closed this in e7877fd Apr 4, 2017

ron8hu deleted the filterTwoColumns branch April 4, 2017 00:43

[SPARK-19408][SQL] filter estimation on two columns of same table #17415

[SPARK-19408][SQL] filter estimation on two columns of same table #17415

Uh oh!

Conversation

ron8hu commented Mar 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Mar 24, 2017

Uh oh!

ron8hu commented Mar 25, 2017

Uh oh!

SparkQA commented Mar 25, 2017

Uh oh!

gatorsmile commented Mar 25, 2017

Uh oh!

SparkQA commented Mar 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile Mar 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 28, 2017

Uh oh!

SparkQA commented Mar 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ron8hu Mar 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ron8hu Mar 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 29, 2017

Uh oh!

SparkQA commented Mar 29, 2017

Uh oh!

ron8hu commented Mar 24, 2017 •

edited

Loading

gatorsmile commented Mar 27, 2017 •

edited

Loading

gatorsmile Mar 28, 2017 •

edited

Loading

viirya Mar 29, 2017 •

edited

Loading

ron8hu Mar 29, 2017 •

edited

Loading

ron8hu Mar 29, 2017 •

edited

Loading

ron8hu Apr 2, 2017 •

edited

Loading