-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
While looking at the results of the most recent clickbench run
Here is the ClickBench page (link)
I see there are a few queries where DataFusion is significantly slower

The queries are:
Q18:
| SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10; |
- Improve vectorized operations of
GroupColumn#13275 - Potential performance regression for TPCH q18 #13188
- Optimize date_part
Q35:
| SELECT "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3, COUNT(*) AS c FROM hits GROUP BY "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3 ORDER BY c DESC LIMIT 10; |
Describe the solution you'd like
I would like the queries to go faster
Describe alternatives you've considered
Both queries look like
SELECT COUNT(...) cnt ... ORDER BY cnt DESC LIMIT 10In other words they are "top 10 count" style queries
By default, DataFusion will compute the counts for all groups, and then pick only the top 10.
I suspect there is some fancier way to do this, perhaps by finding the top 10 values of count when emitting from the group operator or something. It would be interesting to see if we can see what other engines like DuckDB do with this query
Additional context
No response
tlm365, Dimchikkk, jonathanc-n and Rachelint
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request