Skip to content

Entire input is resorted when the data is partially sorted (not using PartialSortExec) #16899

@alamb

Description

@alamb

Describe the bug

When data is sorted on a prefix, but not all, of the input columns I expect DataFusion to use the faster / more memory efficient operator PartialSortExec:

However, this does not appear to be happening

To Reproduce

> copy (values(1, 'a'), (2,'b'), (3,'d'), (4,'a')) to '/tmp/order.csv';
+-------+
| count |
+-------+
| 4     |
+-------+
1 row(s) fetched.
Elapsed 0.005 seconds.


> create external table order stored as csv location '/tmp/order.csv' with order (column1 asc);
0 row(s) fetched.
Elapsed 0.002 seconds.

When ordering by just column1 (which is the declared table order) we can see the plan correctly avoids sorting 🎉

> explain select * from order ORDER BY column1;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │          files: 1         │ |
|               | │        format: csv        │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.009 seconds.

However, when ordering by column1, column2 (the table is a prefix of the declared table order) we can see the plan uses a SortExec (which resorts the entire input, rather than just sorting within batches):

> explain select * from order ORDER BY column1, column2;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │          SortExec         │ |
|               | │    --------------------   │ |
|               | │ column1@0 ASC NULLS LAST, │ |
|               | │  column2@1 ASC NULLS LAST │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │          files: 1         │ |
|               | │        format: csv        │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+

Expected behavior

I expect the existing sort order to be used

Specifically, in the query above I expect that PartialSortExec is used instead:

> explain select * from order ORDER BY column1, column2;

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingperformanceMake DataFusion faster

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions