-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Labels
bugSomething isn't workingSomething isn't workingperformanceMake DataFusion fasterMake DataFusion faster
Description
Describe the bug
When data is sorted on a prefix, but not all, of the input columns I expect DataFusion to use the faster / more memory efficient operator PartialSortExec
:
pub struct PartialSortExec { |
However, this does not appear to be happening
To Reproduce
> copy (values(1, 'a'), (2,'b'), (3,'d'), (4,'a')) to '/tmp/order.csv';
+-------+
| count |
+-------+
| 4 |
+-------+
1 row(s) fetched.
Elapsed 0.005 seconds.
> create external table order stored as csv location '/tmp/order.csv' with order (column1 asc);
0 row(s) fetched.
Elapsed 0.002 seconds.
When ordering by just column1
(which is the declared table order) we can see the plan correctly avoids sorting 🎉
> explain select * from order ORDER BY column1;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ files: 1 │ |
| | │ format: csv │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.009 seconds.
However, when ordering by column1, column2
(the table is a prefix of the declared table order) we can see the plan uses a SortExec (which resorts the entire input, rather than just sorting within batches):
> explain select * from order ORDER BY column1, column2;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ SortExec │ |
| | │ -------------------- │ |
| | │ column1@0 ASC NULLS LAST, │ |
| | │ column2@1 ASC NULLS LAST │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ files: 1 │ |
| | │ format: csv │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
Expected behavior
I expect the existing sort order to be used
Specifically, in the query above I expect that PartialSortExec
is used instead:
> explain select * from order ORDER BY column1, column2;
Additional context
No response
zhuqi-lucas
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingperformanceMake DataFusion fasterMake DataFusion faster