-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem or challenge?
The interval arithmetic library in DataFusion works via two fundamental methods: evaluate_bounds for bottom-up evaluation, and propagate_constraints for top-down propagation. We apply each traversal once to update bounds on columns. However, when there are complex expressions where symbols (e.g. columns) appear more than once, this will give overly pessimistic bounds, resulting in missed opportunities in data pruning and optimizations.
Describe the solution you'd like
If we use Timothy Hickey's interval narrowing approach and utilize a single update_bounds API (instead of the two APIs we have today), we can utilize a queue of nodes to only narrow intervals of nodes that keep shrinking. Simply stated, the update_bounds function is a simultaneous computation of both evaluation and propagation logics -- it updates intervals of the parent and the children nodes at the same time.
This will allow us to arrive at more precise bounds without unnecessarily traversing the expression DAG too many times.
Describe alternatives you've considered
We can apply evaluate_bounds and propagate_constraints in a loop until column bounds converge to a fixed point (to some tolerance), but that would be inefficient (albeit easy).