Skip to content

Optimize "LIMIT" queries for speed / memory with special TopK operator #7196

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This pattern is common:

SELECT c1, c2
FROM t
ORDER BY c3
LIMIT 10

For example we have queries in IOx like the following (this is the same pattern @NGA-TRAN describes on #7162)

SELECT tag, value1, ...
FROM t
WHERE other_column = 'foo'
ORDER BY time
LIMIT 10

Describe the solution you'd like

If the data IS NOT already sorted, what happens today is a plan like

LIMIT(fetch=10)
  SORT(sort_exprs=[c3] fetch=10)
    SCAN(...)

And the Sort can take partial advantage of the fetch -- and it will be better after @gruuya 's change in #7180

We can probably do better still with a special operator like the following that uses some specialized structure (perhaps some type of heap)

TOPK(fetch=10, sort_exprs=[c3])
    SCAN(...)

Describe alternatives you've considered

If the data is already sorted the right way, DataFusion can just read first N values and stop as described on #7162

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions