Support wider range of Subquery, handle the Count bug #6457

mingmwang · 2023-05-26T10:00:55Z

Which issue does this PR close?

Closes #6428
Closes #5808
Closes #6497

Rationale for this change

The existing de-correlation rules DecorrelatePredicateSubquery only support Projection or Distinct as the top level plan in the Subquery and the Filter is the child of the top plan. And the rule ScalarSubqueryToJoin also assumes
the top level plan of the Subquery is a Projection and the child is a Aggregation.

But actually the shape of the Subquery plans can be very flexible and the Filters which include the correlated expressions(out reference columns) can be nested in very deep plans.

This PR re-implement the correlation expressions pull-up process and support as much case as possible no matter the shape of the Subquery plans. It covers almost all the case that the subquery plan can be covered to Left Semi/Left Anti(for In/Exists Subquery) or Left Join or Cross Join(Scalar Subquery).

After this PR, the remaining unsupported Subquery cases:

Uncorrelated Exists Subquery
Correlated In Subquery contains Limit clause or Order By clause(can not pull up correlated expressions)
Correlated Scalar Subquery contains Limit clause or Order By clause(can not pull up correlated expressions)
There is Union in the Subquery plan and the Union's children contain correlated expressions.
The correlated expressions are not in the Filter clause, but in Join conditions, aggregation expressions or window expressions etc.

The above 2), 3), 4), 5) cases can not be converted to simple joins, need add another new rule and use a difference approach to de-correlate them.

Some TPC-DS queries are impacted also and they can generate runnable physical plans now.

What changes are included in this PR?

The alias logic is also changed, we can not simply alias the Subquery Projections or Aggregations output's Expr to some internal alias column like __scalar_value , the Expr might be used by some top level plan, like Having clause or
used as a Join conditions..

Are these changes tested?

Yes. Added serval UTs.
I will add more UT tomorrow to cover the count aggregation cases(there is bug here).

Are there any user-facing changes?

jackwener · 2023-05-26T16:06:20Z

Thanks @mingmwang .
I'm going away for the weekend and won't have time to review the code.
I'm going to review it next week.

datafusion/optimizer/src/utils.rs

datafusion/expr/src/logical_plan/plan.rs

datafusion/core/tests/tpcds_planning.rs

jackwener · 2023-05-30T05:30:51Z

datafusion/core/tests/sql/joins.rs

        "    TableScan: t1 projection=[t1_id, t1_name, t1_int] [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
        "    SubqueryAlias: __correlated_sq_1 [CAST(t2_id AS Int64) + Int64(1):Int64;N]",
-        "      Projection: CAST(t2.t2_id AS Int64) + Int64(1) AS CAST(t2_id AS Int64) + Int64(1) [CAST(t2_id AS Int64) + Int64(1):Int64;N]",
+        "      Projection: CAST(t2.t2_id AS Int64) + Int64(1) AS t2.t2_id + Int64(1) AS CAST(t2_id AS Int64) + Int64(1) [CAST(t2_id AS Int64) + Int64(1):Int64;N]",


I don't figure out why occur double alias

The double alias is not caused by the decorrelate rules. It's caused by the other logical optimization rule:

new_plan LeftSemi Join: Filter: CAST(t1.t1_id AS Int64) + Int64(12) = __correlated_sq_3.CAST(t2_id AS Int64) + Int64(1) TableScan: t1 SubqueryAlias: __correlated_sq_3 Projection: CAST(t2.t2_id AS Int64) + Int64(1) AS CAST(t2_id AS Int64) + Int64(1), CAST(t2.t2_id AS Int64) + Int64(1) TableScan: t2

I don't figure out why occur double alias

Fixed the alias problem.

mingmwang · 2023-05-30T10:15:40Z

I added additional logic to handle the count() aggregations in subquery, the logic is quite ugly.
I'm not sure whether there are other good ways to handle this.

@jackwener
Could you please help to check how this SQL was rewritten in Apache Doris?

SELECT t1_id, (SELECT count(*) + 2 as _cnt FROM t2 WHERE t2.t2_int = t1.t1_int) from t1

jackwener · 2023-05-30T13:41:25Z

@mingmwang doris don't support project correlated subquery

alamb · 2023-05-30T14:13:55Z

I am planning to focus my review efforts on other parts of DataFusion where I have more to add -- I don't plan to review this PR unless someone think that is important.

Thank you @mingmwang and @jackwener for pushing this forward

mingmwang · 2023-05-31T01:27:20Z

@alamb @jackwener
I'm going to mark this PR as draft, I need more time to think and make the count() bug handling logic more generic. Although the current correlation expression pull up and subquery rewrite is more generic, but the count aggregation handling is ugly.

The work should be related to reordering out joins and aggregations.

liurenjie1024 · 2023-06-01T13:19:00Z

Why we don't implement the unnesting arbitrary subquery paper? I think it's state of art.🤔

mingmwang · 2023-06-01T15:43:20Z

Why we don't implement the unnesting arbitrary subquery paper? I think it's state of art.🤔

@liurenjie1024

What this PR and the previous PRs I implemented/refactored still belong to the simple Unnesting method, they covers the Predicate(In/Exists) Subquery and Scalar Subquery cases in which the correlated expressions can be pull up and correlation can be converted to out joins or semi/anti joins.

For other more complex cases, they can be de-correlated using the methods mentioned in the unnesting arbitrary subquery paper. I will try to implement it later this year. Why not implement the unnesting arbitrary subquery paper directly is because this method might introduce additional joins compared to the simple unnesting method. The additional join comes from the inner table join with the distinct set(magic set).

You can play with the Hyper web interface(Hyper implemented this unnesting arbitrary subquery paper) and check the plan.

Hyper:
https://hyper-db.de/interface.html#

mingmwang · 2023-06-01T15:51:13Z

@mingmwang doris don't support project correlated subquery

@jackwener
Could you please to check this query in Apache Doris ?
select t1.t1_int from t1 where (select count(*) from t2 where t1.t1_id = t2.t2_id) < t1.t1_int

        "Projection: t1.t1_int [t1_int:UInt32;N]",
        "  Filter: CASE WHEN __scalar_sq_1.__always_true IS NULL THEN Int64(0) ELSE __scalar_sq_1.COUNT(UInt8(1)) END < CAST(t1.t1_int AS Int64) [t1_int:UInt32;N, COUNT(UInt8(1)):Int64;N, __always_true:Boolean;N]",
        "    Projection: t1.t1_int, __scalar_sq_1.COUNT(UInt8(1)), __scalar_sq_1.__always_true [t1_int:UInt32;N, COUNT(UInt8(1)):Int64;N, __always_true:Boolean;N]",
        "      Left Join: t1.t1_id = __scalar_sq_1.t2_id [t1_id:UInt32;N, t1_int:UInt32;N, COUNT(UInt8(1)):Int64;N, t2_id:UInt32;N, __always_true:Boolean;N]",
        "        TableScan: t1 projection=[t1_id, t1_int] [t1_id:UInt32;N, t1_int:UInt32;N]",
        "        SubqueryAlias: __scalar_sq_1 [COUNT(UInt8(1)):Int64;N, t2_id:UInt32;N, __always_true:Boolean]",
        "          Projection: COUNT(UInt8(1)), t2.t2_id, __always_true [COUNT(UInt8(1)):Int64;N, t2_id:UInt32;N, __always_true:Boolean]",
        "            Aggregate: groupBy=[[t2.t2_id, Boolean(true) AS __always_true]], aggr=[[COUNT(UInt8(1))]] [t2_id:UInt32;N, __always_true:Boolean, COUNT(UInt8(1)):Int64;N]",
        "              TableScan: t2 projection=[t2_id] [t2_id:UInt32;N]",

I think this might also lead to the count bug if not handled correctly even it is the predicate scalar subquery.

mingmwang · 2023-06-02T02:30:31Z

@alamb
Do you having time to review this PR next week?

I think subquery unnesting is important in TPS-DS and OLAP workload, and support wider ranger of subqueries will make DataFusion more competitive with other products.

alamb · 2023-06-02T16:22:28Z

Do you having time to review this PR next week?

Yes, I will find time. Thank you for the contribution -- I just don't have enough review bandwidth to keep up with everything!

alamb · 2023-06-06T10:53:56Z

This is on my list for review today

alamb

Thank you @mingmwang -- I realistically can't review a 3000+ line PR in fine grain detail, but I did review most of the plan changes in this PR and I didn't see any issues with merging this PR

Likewise, I reviewed the code and I found it easy to follow.

In addition, given that a substantial amount of this PR is tests and that a bunch of the tpc-ds queries now run, I think we should merge it in and continue improvements on main.

alamb · 2023-06-06T18:21:24Z

datafusion/core/tests/sql/joins.rs

        "Explain [plan_type:Utf8, plan:Utf8]",
-        "  LeftSemi Join: CAST(t1.t1_int AS Int64) = __correlated_sq_2.CAST(t2_int AS Int64) + Int64(1) [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
-        "    LeftSemi Join: CAST(t1.t1_id AS Int64) + Int64(12) = __correlated_sq_1.CAST(t2_id AS Int64) + Int64(1) [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
+        "  LeftSemi Join: CAST(t1.t1_int AS Int64) = __correlated_sq_2.t2.t2_int + Int64(1) [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",


do these plans actually have fewer CASTs or is it just an improvement in ALIAS generation?

It just improves the unnecessary ALIAS generation.

alamb · 2023-06-06T18:22:26Z