Skip to content

[Epic] A collection of items related to processing larger than memory datasets (via spilling, externalized algorithm, etc) #14077

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This epic attempts to organize attempts to improve DataFusion's ability to process datasets that are larger than fit in configured memory budget

Some of DataFusion's "pipeline blocking" operations (SortExec and HashGroupBy) already do work with datasets that are larger than fit in memory, but the performance and usability could be improved

Note: Joins are another operation that can run out of memory and will error (rather than falling back to some other strategy like Sort-Merge-Join for example). If people are interested in making this better, I think we could organize another project

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions