Skip to content

Conversation

@ron8hu
Copy link
Contributor

@ron8hu ron8hu commented Mar 24, 2017

What changes were proposed in this pull request?

In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work.

This PR estimates filter selectivity on two columns of same table. For example, multiple tpc-h queries have this predicate "WHERE l_commitdate < l_receiptdate"

How was this patch tested?

We added 6 new test cases to test various logical predicates involving two columns of same table.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@gatorsmile
Copy link
Member

ok to test

@ron8hu
Copy link
Contributor Author

ron8hu commented Mar 25, 2017

cc @sameeragarwal @cloud-fan @gatorsmile This Jira is not on Spark 2.2 blocker list. If time permits, we can include it in Spark 2.2. If not, we can wait for a maintenance release. Thanks.

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75186 has finished for PR 17415 at commit 8930669.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75223 has finished for PR 17415 at commit 8930669.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for EqualTo, there is no complete overlap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just revised the code to handle EqualNullSafe separately from EqualTo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the new ndv only look at ndvLeft?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why comment it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I should remove the line that has been commented out. This line is replaced by the following code:
if (rowCountValue != 0) {
// Need to check attributeStats one by one because we may have multiple output columns.
// Due to update operation, the output columns may be in different order.
expectedColStats.foreach { kv =>
val filterColumnStat = filterStats.attributeStats.get(kv._1).get
assert(filterColumnStat == kv._2)
}

@gatorsmile
Copy link
Member

gatorsmile commented Mar 27, 2017

It sounds like we have not supported a very common constant filter. Let me take a quick fix on that. : ) Our optimizer rule PruneFilters is not advanced enough to remove all the literal filters.

@ron8hu ron8hu force-pushed the filterTwoColumns branch from d6a53ef to 9830a8f Compare March 28, 2017 00:39
Copy link
Member

@gatorsmile gatorsmile Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use Equality above, but we did not handle EqualNullSafe. That will cause a strange case mismatch error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullCount might not be simply set to zero if we also support EqualNullSafe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use white list here? It is also easy for us to see which data types are assumed to support in the implementation.

I am afraid we might easily forget updating this if we support new data type in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is written in such a way that we do not have too deep indentation. Some engineers do not like deep indentation as they often put screen monitor vertically.
Let's handle it when the need occurs. I think, with good test case coverage, we will be able to catch anything we miss.

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75288 has finished for PR 17415 at commit 9830a8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75284 has finished for PR 17415 at commit d6a53ef.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@ron8hu ron8hu force-pushed the filterTwoColumns branch from 9830a8f to 7abed99 Compare March 29, 2017 02:53
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a given column -> two given columns. Both two columns' ColumnStat are updated.

Copy link
Member

@viirya viirya Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maxLeft <= minRight, minLeft > maxRight)?

Copy link
Contributor Author

@ron8hu ron8hu Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maxLeft < minRight, minLeft >= maxRight)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minLeft >= maxRight, maxLeft < minRight)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

@ron8hu ron8hu Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minLeft > maxRight, maxLeft <= minRight)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75335 has finished for PR 17415 at commit 7abed99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ron8hu ron8hu force-pushed the filterTwoColumns branch 2 times, most recently from 64bf43e to 70ac70c Compare March 29, 2017 21:34
@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75367 has finished for PR 17415 at commit 64bf43e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space.

@ron8hu ron8hu force-pushed the filterTwoColumns branch from 70ac70c to 9b98ff1 Compare March 29, 2017 23:43
@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75369 has finished for PR 17415 at commit 70ac70c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 30, 2017

Test build #75375 has finished for PR 17415 at commit 9b98ff1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once no overlap, is it still meaningful to keep min, max?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need one more condition: minLeft == minRight, like @gatorsmile suggested #17415 (comment)

Copy link
Contributor Author

@ron8hu ron8hu Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said "we need one more condition: minLeft == minRight". Note that this condition is already included.

@gatorsmile was suggesting "(minRight == maxRight) && (minLeft == minRight) && (maxLeft == maxRight)". This implies all 4 values (minLeft, maxLeft, minRight, maxRight) are equal. This is not what I mean by complete overlap. I initially defined "complete overlap" as "complete range overlap". For example, we have a test case: test("cint = cint4"). If we use @gatorsmile's definition, then the case "cint = cint4" will become partial overlap with selectivity 0.33, which will under-estimate the selectivity. In order to avoid out-of-memory error, I prefer over-estimating rather than under-estimating.

Also it should be noted that "complete range overlap" should cover "complete point overlap".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Estimation is always hard to be accurate. That is why user-provided hints are very useful for getting the right plan.

I did a search. There are many different estimators. Uniform estimators, length estimator, digram estimator, minimum variance estimators, and histogram estimators. Is that possible we can consider the data distributions when deciding the selectivity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, estimation is always hard to be accurate. We may consider supporting hints in the future.

@ron8hu ron8hu force-pushed the filterTwoColumns branch from bf440db to 4f0b68f Compare April 2, 2017 23:30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also compare the NDV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We changed condition to:
(minLeft == minRight) && (maxLeft == maxRight) && allNotNull
&& (colStatLeft.distinctCount == colStatRight.distinctCount)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that the when this condition is true, it is a complete overlapping between two columns.

The complete equality between the values of two columns also depends on the order. E.g., when left values are (1, 2, 3, 4), right values are (4, 3, 2, 1), the condition is true, but no values can pass the filter predicate left_col = right_col.

Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is empirical. Without more statistics, it's really hard to do it mathematically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We prefer over estimation to under estimation in order to avoid out-of-memory error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. LGTM.

@SparkQA
Copy link

SparkQA commented Apr 3, 2017

Test build #75470 has finished for PR 17415 at commit 4f0b68f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 3, 2017

Test build #75472 has started for PR 17415 at commit 5a02705.

@cloud-fan
Copy link
Contributor

retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& (colStatLeft.distinctCount == colStatRight.distinctCount)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indention is wrong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only checks the expectedColStats is a sub-set of filterStats.attributeStats, shall we also check the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. fixed.

@cloud-fan
Copy link
Contributor

LGTM except some minor comments

@SparkQA
Copy link

SparkQA commented Apr 3, 2017

Test build #75477 has finished for PR 17415 at commit 5a02705.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ron8hu ron8hu force-pushed the filterTwoColumns branch from 5a02705 to 3f3d30d Compare April 3, 2017 17:41
@SparkQA
Copy link

SparkQA commented Apr 3, 2017

Test build #75484 has finished for PR 17415 at commit 3f3d30d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

1 similar comment
@viirya
Copy link
Member

viirya commented Apr 4, 2017

LGTM

@gatorsmile
Copy link
Member

Thanks, Merging to master.

@asfgit asfgit closed this in e7877fd Apr 4, 2017
@ron8hu ron8hu deleted the filterTwoColumns branch April 4, 2017 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants