-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
In current single_distinct_to_group_by rule, we only support all aggregate function are distinct aggregate.
But if the no distinct aggregate is count/min/max/sum, we can also do same transform like in single_distinct_to_group_by
before
select a, count(distinct b), count(c)
from t
group by aafter
select a, count(alias1), sum(alias2)
from (
select a, b as alias1, count(c) as alias2
from t
group by a, b
)
group by aDescribe the solution you'd like
By write, we can improve the perfmance of distinct aggregate
❯ SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID") FROM '../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID" LIMIT 10;
+----------+--------------------------------------------------+--------------------------------------------------------+
| RegionID | SUM(../benchmarks/data/hits.parquet.AdvEngineID) | COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID) |
+----------+--------------------------------------------------+--------------------------------------------------------+
| 0 | 0 | 8 |
| 1 | 147946 | 239380 |
| 2 | 441662 | 1081016 |
| 3 | 39724 | 131195 |
| 4 | 34557 | 79500 |
| 5 | 13502 | 40914 |
| 6 | 24338 | 55768 |
| 7 | 28417 | 64989 |
| 8 | 34483 | 65472 |
| 9 | 38047 | 91576 |
+----------+--------------------------------------------------+--------------------------------------------------------+
10 rows in set. Query took 1.357 seconds.
❯ SELECT "RegionID", SUM(t1), count("UserID") from (select "UserID", "RegionID", sum("AdvEngineID") as t1 from '../benchmarks/data/hits.parquet' group by "UserID", "RegionID") group by "RegionID" order by "RegionID" limit 10;
+----------+---------+-----------------------------------------------+
| RegionID | SUM(t1) | COUNT(../benchmarks/data/hits.parquet.UserID) |
+----------+---------+-----------------------------------------------+
| 0 | 0 | 8 |
| 1 | 147946 | 239380 |
| 2 | 441662 | 1081016 |
| 3 | 39724 | 131195 |
| 4 | 34557 | 79500 |
| 5 | 13502 | 40914 |
| 6 | 24338 | 55768 |
| 7 | 28417 | 64989 |
| 8 | 34483 | 65472 |
| 9 | 38047 | 91576 |
+----------+---------+-----------------------------------------------+
10 rows in set. Query took 0.919 seconds.
Describe alternatives you've considered
No response
Additional context
https://www.querifylabs.com/blog/distinct-aggregation-optimization-in-apache-calcite-and-trino
https://github.com/apache/calcite/blob/96b05ee12f936ed057265072ff6a2de8ea0a249e/core/src/main/java/org/apache/calcite/rel/rules/AggregateExpandDistinctAggregatesRule.java#L286-L298
alamb
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request