Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR is a draft version of another alternative of CNF normalization based on comment in PR #8200. This PR doesn't include test cases, and is only for further discussion.

In this version, CNF normalization is implemented as a separate function Predicate.toCNF, which accepts an optional expansion threshold to prevent exponential explosion. The motivation behind this design is that, CNF normalization itself can be useful in other use cases (e.g., eliminating common predicate factors). It would be convenient if we can call it from anywhere without involving the optimizer.

Another consideration is that, if no expansion threshold is provided, toCNF should always return a predicate that is really in CNF. That's why a new RuleExecutor strategy FixedPoint.Unlimited is added.

A major compromise here is that, we may not convert all predicates to CNF, and can't guarantee that filters passed to data sources are in CNF, thus we may lose potential optimization opportunities in front of complicated filter predicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, the expansion threshold can be made a configuration option.

@liancheng liancheng changed the title [SPARK-6624][WIP] Another alternative version of CNF normalization [SPARK-6624][WIP] Draft of another alternative version of CNF normalization Dec 23, 2015
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the threshold is exceeded, the original predicate rather than the intermediate converted predicate is returned. This because the intermediate result may not be in CNF, thus:

  1. It doesn't bring much benefit for filter push-down, and
  2. It's much larger than the original predicate and brings extra evaluation cost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with 1. I don't see why it matters if it is all CNF or none. I think the heuristic we want is something like "maximize the number of simple predicates that are in CNF form". Simple here means contains just 1 attribute or binary predicate between two. These are candidates for benefiting from further optimization.

We could try cost basing this or just stopping the expansion after some amount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maximizing the number of simple predicates sounds reasonable. We may do the conversion in a depth-first manner, i.e. always convert the left branch of an And and then its right branch, until either no more predicates can be converted or we reach the size limit. In this way the intermediate result is still useful.

BTW, searched for CNF conversion in Hive and found HIVE-9166, which also tries to put an upper limit for ORC SARG CNF conversion. @nongli Any clues about how Impala does this?

@liancheng
Copy link
Contributor Author

cc @nongli @rxin @marmbrus @yjshen

@SparkQA
Copy link

SparkQA commented Dec 23, 2015

Test build #48236 has finished for PR 10444 at commit 031278f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 23, 2015

Test build #48238 has finished for PR 10444 at commit 6e2c052.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants