-
Notifications
You must be signed in to change notification settings - Fork 0
POC Rewrite bloom filter pruning predicate logic in terms of intervals #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs
Outdated
Show resolved
Hide resolved
| fn prune(&self, column_sbbf: &HashMap<String, Sbbf>) -> bool { | ||
| // rewrite any <col = literal> exprs to to `false` if we can prove they | ||
| // are always false using the bloom filter. | ||
| let rewritten = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite pleased with this logic. What it does is to rewrite the predicate String@0 = Hello_Not_Exists AND 1 = 1 is rewritten to false AND false and the interval analysis determines that expression is always false:
Evaluating. self.predicate_expr: String@0 = Hello_Not_Exists AND 1 = 1
rewritten false AND 1 = 1
interval result: [false, false]
lower: IntervalBound { value: Boolean(false), open: false }
upper: IntervalBound { value: Boolean(false), open: false }
However it seems I need to update the interval analysis so that OR is handled as OR gives a "unknown" for the bounds:
Evaluating. self.predicate_expr: String@0 = Hello_Not_Exists OR String@0 = Hello_Not_Exists2
rewritten false OR false
interval result: (NULL, NULL)
lower: IntervalBound { value: NULL, open: true }
upper: IntervalBound { value: NULL, open: true }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added support for OR here: apache#7884
|
I agree w/ you that pruning analysis and interval arithmetic should be implemented ONCE, ideally using an interface that both arrow and datafusion can use. |
I filed apache#7887 to track this idea |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
This is related to apache#7821 and the hope of a high level unification of PruningPredicate and the interval arithmetic (I thought there was a ticket, but I can't find it)
Rationale for this change
Basically DataFusion has two types of range analysis -- the
Intervalbased analysis used in selectivity analysis and thePruningPrediatebased rewrite used for pruning row groups in parquet.apache#7821 adds a
BloomFilterPredicatein the style ofPruningPredicate. However, this only supports binary expressions (col = <const>) andAND/ORand is entirely separate from statistics based pruning. This means that predicates likecol IN (...)andIS NOT NULLmay not be completely supported.I am pretty sure I could express the same analysis in terms of
Intervals, which:What changes are included in this PR?
Are these changes tested?
Yes, existing tests
Are there any user-facing changes?
Not really.