-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
Rule push_down_projection negates the optimization results of common_sub_expression_eliminate, but leaves the useless aliases introduced by common_sub_expression_eliminate. Those added aliases change the signature of the logical plan, also causing the optimizer to never reach the fixed point.
In summary, there are two problems:
- The optimization of
common_sub_expression_eliminationdid not work. - Ineffective optimization leads to the optimizer not being able to exit early until it reaches the limit of
datafusion.optimizer.max_passes.
To Reproduce
Run explain verbose select a/2, a/2 + 1 from t in CLI
DataFusion CLI v33.0.0
❯ create table t(a bigint);
0 rows in set. Query took 0.006 seconds.
❯ explain verbose select a/2, a/2 + 1 from t;
| logical_plan after common_sub_expression_eliminate | Projection: t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2), t.a / Int64(2)Int64(2)t.a AS t.a / Int64(2) + Int64(1) |
| | Projection: t.a / Int64(2) AS t.a / Int64(2)Int64(2)t.a, t.a |
| | TableScan: t
| ...
|
| logical_plan after push_down_projection | Projection: t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2), t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) + Int64(1) |
| | TableScan: t projection=[a]
❯ explain select a/2, a/2 + 1 from t;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2), t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) + Int64(1) |
| | TableScan: t projection=[a] |
| physical_plan | ProjectionExec: expr=[a@0 / 2 as t.a / Int64(2), a@0 / 2 + 1 as t.a / Int64(2) + Int64(1)] |
| | MemoryExec: partitions=1, partition_sizes=[0] |
| | |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+common_sub_expression_eliminate generates a new child projection which includes common expressions under the original projection, and push_down_projection merges them into one.
The final projection is:
a/2: t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2)a/2+1: t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) + Int64(1)
Three duplicate aliases appeared, corresponding to datafusion.optimizer.max_passes=3.
If I set datafusion.optimizer.max_passes=10
❯ set datafusion.optimizer.max_passes=10;
0 rows in set. Query took 0.002 seconds.
❯ explain select a/2, a/2 + 1 from t;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2), t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) + Int64(1) |
| | TableScan: t projection=[a] |
| physical_plan | ProjectionExec: expr=[a@0 / 2 as t.a / Int64(2), a@0 / 2 + 1 as t.a / Int64(2) + Int64(1)] |
| | MemoryExec: partitions=1, partition_sizes=[0] |
| | |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.025 seconds.One of the expr in the final projection will be t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2) AS t.a / Int64(2).
There are too many unnecessary aliases.
Expected behavior
The logic merge_projection inside the rule push_down_projection should not undo common_sub_expression_eliminate.
Additional context
I plan to work on this in the next few days.
My initial thought is to fix it in the push_down_projection side.
Do not execute merge_projection if an expression in the child projection satisfies the following conditions:
- It has been referenced by the parent projection two or more times. Keep the optimization effect of
common_sub_expression_eliminate. - Its evaluation is non-trivial. Its type is not
ColumnorLiteral.