Skip to content

Add ability to specify external sort information for ParquetExec #4169

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
IOx stores parquet files in a particular sort order, and then uses the fact the data is sorted for a variety of sort related optimizations

The new BasicEnforcement rule added in #4122 by @mingmwang is (correctly) deciding that since the ParquetExec declares its output is not sorted, it needs to add a SortExec which is unnecessary in our case and will slow performance dramatically.

I think the way to avoid this is to teach DataFusion that the ParquetExec is actually sorted (which is is) and then everything will work out.

Describe the solution you'd like
I would like a way for someone constructing a ParquetExec manually to be able to specify that the data is already sorted.

Describe alternatives you've considered
It might be possible to figure out the sort order of the data given the parquet metadata, but I haven't looked into that carefully

Additional context

As a bonus, I think at least some part of our plan construction logic in IOx that adds SortExec's in to sort the data could potentially be removed as it is now covered by the DataFusion optimizer.

See more detail at https://github.com/influxdata/influxdb_iox/pull/6108#discussion_r1019387151

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions