Skip to content

Conversation

@haohuaijin
Copy link
Contributor

Which issue does this PR close?

Closes #8735

Rationale for this change

when debug #8735, I find not only the case in #8735 don't push down, we have other case don't push down
for example

DataFusion CLI v34.0.0
❯ create table t(a int, b int, c int);
0 rows in set. Query took 0.014 seconds.

❯ explain select -a from t;
+---------------+-------------------------------------------------+
| plan_type     | plan                                            |
+---------------+-------------------------------------------------+
| logical_plan  | Projection: (- t.a)                             |
|               |   TableScan: t projection=[a, b, c]             |
| physical_plan | ProjectionExec: expr=[(- a@0) as (- t.a)]       |
|               |   MemoryExec: partitions=1, partition_sizes=[0] |
|               |                                                 |
+---------------+-------------------------------------------------+
2 rows in set. Query took 0.016 seconds.

What changes are included in this PR?

add more case to outer_columns_helper.

Are these changes tested?

yes, add more tests

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jan 6, 2024

query RRR
select min(f5), max(f5), avg(f5) from m2 where tag_id = '1000' and time < '2024-01-03T14:46:35+01:00' group by type;
select min(f5), max(f5), avg(f5) from m2 where tag_id = '1000' and time < '2024-01-03T14:46:35+01:00' group by type order by min(f5);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make result stable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different fix is here: #8769

@haohuaijin
Copy link
Contributor Author

haohuaijin commented Jan 6, 2024

the root reason maybe in below code
https://github.com/apache/arrow-datafusion/blob/4e4059a68455fbc14f04902c76acbcd258b7f2ef/datafusion/optimizer/src/optimize_projections.rs#L771-L775

I'm not sure why we handle it this way. I tried commenting out this line, and all the tests still passed(the only test don't pass is not related to this, and I comment above in dictionary.slt.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @haohuaijin -- given this PR fixes the bug and includes good test coverage ❤️ I think we could merge it as is.

However, I had some suggestions to improve the code even more -- do you think we should try to do those as part of this PR or maybe as a follow on PR?

| Expr::IsNull(expr)
| Expr::Negative(expr) => outer_columns_helper(expr, columns),
Expr::Column(_) | Expr::Literal(_) | Expr::Wildcard { .. } => true,
_ => false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about removing this catchall (and and thus explicitly enumerating all Expr types so this type of bug can't happen again)?

Suggested change
_ => false,

An alternate idea would be to remove this explicit walk down the tree entirely and use https://docs.rs/datafusion/latest/datafusion/logical_expr/utils/fn.expr_to_columns.html instead?

This pushdown code may predate the more generaic tree walks

Copy link
Contributor Author

@haohuaijin haohuaijin Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Already removed the catchall.

An alternate idea would be to remove this explicit walk down the tree entirely and use https://docs.rs/datafusion/latest/datafusion/logical_expr/utils/fn.expr_to_columns.html instead?

the outer_columns_helper function is used to collect all outer_ref_column that expr_to_columns function doesn't return these columns. Because expr_to_columns use TreeNode to walks the expr tree, but TreeNode don't walk the outer_ref_column
https://github.com/apache/arrow-datafusion/blob/dd4263f843e093c807d63edf73a571b1ba2669b5/datafusion/expr/src/tree_node/expr.rs#L69-L77

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the outer_columns_helper function is used to collect all outer_ref_column that expr_to_columns function doesn't return these columns. Because expr_to_columns use TreeNode to walks the expr tree, but TreeNode don't walk the outer_ref_column

I see -- thank you for the explanation, which makes sense. It fiddled a little with the code, and I think I have a cleanup here that uses the TreeNode walking to avoid having to enumerate Expr variants #8787

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @haohuaijin

| Expr::IsNotNull(expr)
| Expr::IsNull(expr)
| Expr::Negative(expr) => outer_columns_helper(expr, columns),
Expr::Column(_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: Unneeded fields pushed to TableProvider if struct field is part of query

3 participants