-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
PROPOSAL EPICA proposal being discussed that is not yet fully underwayA proposal being discussed that is not yet fully underwayenhancementNew feature or requestNew feature or request
Description
Usecase
Many analytic systems store their data with some particular sort order, and the query engine can often take advantage of this sort order to both reduce memory usage and performance
Specific examples in Datafusion include:
- Emitting from GroupBy early with partially sorted stream
SortMergeJoin- Sort removal via
EnforceSortingandreplace_with_order_preserving_variants
This information is currently encoded in ExecutionPlan::maintains_input_order ExecutionPlan::required_input_ordering and PlanProperties
The same underlying analysis is often required for streaming (where determining what to emit is modeled as a sorted stream, for example on date_trunc(ts) of a stream sorted by timestamp).
Describe the solution you'd like
This epic has a list of optimizations / improvements that further take sortedness into account. Here are some related issues:
- Add ability to specify external sort information for ParquetExec #4169
- Enable
split_file_groups_by_statisticsby default #10336 - Add
ProgressiveEvaloperator for optimizeSortPreservingMerge#10488 - [EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition #6672
- Use file statistics in query planning to avoid sorting when unecessary #7490
- Automatically detect and use "is the data sorted" information in parquet file metadata #4177
- Add an option to avoid grouping hash-partitioning #10257
- Implement a way to preserve partitioning through
UnionExecwithout losing ordering #10314 - Optimized version of
SortPreservingMergethat doesn't actually compare sort keys of the key ranges are ordered #10316 - Support sort pushdown #7871
- Optimize SortPreservingMergeExec for single-column merge #13642
- Use Row Format in SortExec #7053
- More accurate memory accounting in external sort #14748 (depends on Use Row Format in SortExec #7053)
phillipleblanc
Metadata
Metadata
Assignees
Labels
PROPOSAL EPICA proposal being discussed that is not yet fully underwayA proposal being discussed that is not yet fully underwayenhancementNew feature or requestNew feature or request