Skip to content

Consider persist=False as default when creating table from DataFrame #218

@randerzander

Description

@randerzander

By default, when creating a table from a Dask Dataframe, the create_table function persists the Dask DataFrame in memory.

This was surprising (partly because I didn't see it in the doc examples) when working on a large ETL job. I suspect for most users (especially those running larger jobs on systems where data is larger than memory) this will be a surprising realization. In my case, I spent time trying to debug which compute tasks might be using more memory than expected only to eventually realize it was just the persist call which failed. Once I re-created the tables w/ persist=False, my workflow completed successfully.

Is there a rationale for persisting by default?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions