Skip to content

Use Arrow Row Format in SortExec to improve performance #5230

@tustvold

Description

@tustvold

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

SortPreservingMerge now makes use of the arrow row format and this has yielded significant performance improvements over the prior DynComparator based approach. We can likely signifcantly improve the performance of SortExec by modifying sort_batch to also make use of the row format when performing multi-column sorts, instead of lexsort_to_indices which internally uses DynComparator.

For single-column sorts lexsort_to_indices will call through to sort_to_indices which will be faster than the row format, we should make sure to keep this special case.

Describe the solution you'd like

A first iteration could simply modify sort_batch to use the row format for multi-column sorts, as demonstrated here, falling back to sort_to_indices if only a single column.

A second iteration could then look to find a way to convert to the row format once, and preserve this encoding when feeding sorted batches into SortPreservingMerge.

Describe alternatives you've considered
We could not do this

Additional context

FYI @alamb @mustafasrepo @ozankabak

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions