Skip to content

Reduce duplication between BoundedAggregateStream and GroupedHashAggregateStream #6798

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

We are trying to make hash based aggregation significantly faster -- see #4973

This will require some non trivial changes to the organization of how hash aggregation works. At the moment BoundedAggregateStream and GroupedHashAggregateStream both share significant amounts of code and so either we will have to duplicate the work to make hashing aggregation faster or else BoundedAggregateStream will not get the benefits.

Here is a visual depiction of the common code:

 meld datafusion/core/src/physical_plan/aggregates/bounded_aggregate_stream.rs datafusion/core/src/physical_plan/aggregates/row_hash.rs 

Screenshot 2023-06-29 at 9 16 18 AM

Describe the solution you'd like

Reduce duplication between BoundedAggregateStream and GroupedHashAggregateStream

The major differences are:

  1. Choice of when output can be emitted
  2. Clearing previous group state when groups have been emitted

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions