[SPARK-6624][WIP] Draft of another alternative version of CNF normalization #10444

liancheng · 2015-12-23T09:32:41Z

This PR is a draft version of another alternative of CNF normalization based on comment in PR #8200. This PR doesn't include test cases, and is only for further discussion.

In this version, CNF normalization is implemented as a separate function Predicate.toCNF, which accepts an optional expansion threshold to prevent exponential explosion. The motivation behind this design is that, CNF normalization itself can be useful in other use cases (e.g., eliminating common predicate factors). It would be convenient if we can call it from anywhere without involving the optimizer.

Another consideration is that, if no expansion threshold is provided, toCNF should always return a predicate that is really in CNF. That's why a new RuleExecutor strategy FixedPoint.Unlimited is added.

A major compromise here is that, we may not convert all predicates to CNF, and can't guarantee that filters passed to data sources are in CNF, thus we may lose potential optimization opportunities in front of complicated filter predicates.

liancheng · 2015-12-23T09:35:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

Apparently, the expansion threshold can be made a configuration option.

liancheng · 2015-12-23T09:45:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

When the threshold is exceeded, the original predicate rather than the intermediate converted predicate is returned. This because the intermediate result may not be in CNF, thus:

It doesn't bring much benefit for filter push-down, and

It's much larger than the original predicate and brings extra evaluation cost.

I disagree with 1. I don't see why it matters if it is all CNF or none. I think the heuristic we want is something like "maximize the number of simple predicates that are in CNF form". Simple here means contains just 1 attribute or binary predicate between two. These are candidates for benefiting from further optimization.

We could try cost basing this or just stopping the expansion after some amount.

Maximizing the number of simple predicates sounds reasonable. We may do the conversion in a depth-first manner, i.e. always convert the left branch of an And and then its right branch, until either no more predicates can be converted or we reach the size limit. In this way the intermediate result is still useful.

BTW, searched for CNF conversion in Hive and found HIVE-9166, which also tries to put an upper limit for ORC SARG CNF conversion. @nongli Any clues about how Impala does this?

liancheng · 2015-12-23T10:04:18Z

cc @nongli @rxin @marmbrus @yjshen

SparkQA · 2015-12-23T11:29:11Z

Test build #48236 has finished for PR 10444 at commit 031278f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-23T13:17:51Z

Test build #48238 has finished for PR 10444 at commit 6e2c052.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng mentioned this pull request Dec 23, 2015

[SPARK-6624][SQL]Add CNF Normalization as part of optimization #8200

Closed

liancheng reviewed Dec 23, 2015
View reviewed changes

Draft version of CNF normalization

08a79ba

liancheng force-pushed the cnf-draft branch from 0fb4beb to 08a79ba Compare December 23, 2015 09:37

liancheng changed the title ~~[SPARK-6624][WIP] Another alternative version of CNF normalization~~ [SPARK-6624][WIP] Draft of another alternative version of CNF normalization Dec 23, 2015

liancheng reviewed Dec 23, 2015
View reviewed changes

Add comment

6e2c052

liancheng force-pushed the cnf-draft branch from 031278f to 6e2c052 Compare December 23, 2015 11:34

liancheng closed this Jan 26, 2016

gatorsmile mentioned this pull request Sep 7, 2016

[SPARK-17357][SQL] Fix current predicate pushdown #14912

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-6624][WIP] Draft of another alternative version of CNF normalization #10444

[SPARK-6624][WIP] Draft of another alternative version of CNF normalization #10444

Uh oh!

liancheng commented Dec 23, 2015

Uh oh!

liancheng Dec 23, 2015

Uh oh!

liancheng Dec 23, 2015

Uh oh!

nongli Dec 28, 2015

Uh oh!

liancheng Dec 29, 2015

Uh oh!

liancheng commented Dec 23, 2015

Uh oh!

SparkQA commented Dec 23, 2015

Uh oh!

SparkQA commented Dec 23, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-6624][WIP] Draft of another alternative version of CNF normalization #10444

[SPARK-6624][WIP] Draft of another alternative version of CNF normalization #10444

Uh oh!

Conversation

liancheng commented Dec 23, 2015

Uh oh!

liancheng Dec 23, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng Dec 23, 2015

Choose a reason for hiding this comment

Uh oh!

nongli Dec 28, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng Dec 29, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Dec 23, 2015

Uh oh!

SparkQA commented Dec 23, 2015

Uh oh!

SparkQA commented Dec 23, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants