-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-6624][WIP] Draft of another alternative version of CNF normalization #10444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, the expansion threshold can be made a configuration option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the threshold is exceeded, the original predicate rather than the intermediate converted predicate is returned. This because the intermediate result may not be in CNF, thus:
- It doesn't bring much benefit for filter push-down, and
- It's much larger than the original predicate and brings extra evaluation cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with 1. I don't see why it matters if it is all CNF or none. I think the heuristic we want is something like "maximize the number of simple predicates that are in CNF form". Simple here means contains just 1 attribute or binary predicate between two. These are candidates for benefiting from further optimization.
We could try cost basing this or just stopping the expansion after some amount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maximizing the number of simple predicates sounds reasonable. We may do the conversion in a depth-first manner, i.e. always convert the left branch of an And and then its right branch, until either no more predicates can be converted or we reach the size limit. In this way the intermediate result is still useful.
BTW, searched for CNF conversion in Hive and found HIVE-9166, which also tries to put an upper limit for ORC SARG CNF conversion. @nongli Any clues about how Impala does this?
|
Test build #48236 has finished for PR 10444 at commit
|
|
Test build #48238 has finished for PR 10444 at commit
|
This PR is a draft version of another alternative of CNF normalization based on comment in PR #8200. This PR doesn't include test cases, and is only for further discussion.
In this version, CNF normalization is implemented as a separate function
Predicate.toCNF, which accepts an optional expansion threshold to prevent exponential explosion. The motivation behind this design is that, CNF normalization itself can be useful in other use cases (e.g., eliminating common predicate factors). It would be convenient if we can call it from anywhere without involving the optimizer.Another consideration is that, if no expansion threshold is provided,
toCNFshould always return a predicate that is really in CNF. That's why a newRuleExecutorstrategyFixedPoint.Unlimitedis added.A major compromise here is that, we may not convert all predicates to CNF, and can't guarantee that filters passed to data sources are in CNF, thus we may lose potential optimization opportunities in front of complicated filter predicates.